Final Project Submission

Please fill out:


Final Summary Figures for Presentation

Outline of Data Processing and Analysis(using OSEMN model)

  1. OBTAIN: Import data, inspect, check for datatypes to convert and null values

    • Display header and info
    • Drop any unneeded columns (df.drop(['col1','col2'],axis=1)
  2. SCRUB: cast data types, identify outliers, check for multicollinearity, normalize data

    • Check and cast data types
      • [x] Check for #'s that are store as objects (df.info())
        • when converting to #'s, look for odd values (like many 0's), or strings that can't be converted
        • Decide how to deal weird/null values (df.unique(), df.isna().sum(), df.describe()-min/max, etc
      • [x] Check for categorical variables stored as integers
    • [x] Check for missing values (df.isna().sum())
      • Can drop rows or colums
      • For missing numeric data with median or bin/convert to categorical
      • For missing categorical data: make NaN own category OR replace with most common category
    • [X] Check for multicollinearity
      • use seaborn to make correlation matrix plot Evernote Link
        • Good rule of thumb is anything over 0.75 corr is high, remove the variable that has the most correl with the largest # of variables
    • [x] Normalize data (may want to do after some exploring)
      • Most popular is Z-scoring (but won't fix skew)
      • Can log-transform to fix skewed data
  1. EXPLORE:Check distributions, outliers, etc
    • [x] Check scales, ranges (df.describe())
    • [x] Check histograms to get an idea of distributions (df.hist()) and dat transformations to perform
      • Can also do kernel density estimates
    • [x] Use scatterplots to check for linearity and possible categorical variables (df.plot(kind-'scatter')
      • categoricals will look like vertical lines
    • [x] Use pd.plotting.scatter_matrix to visualize possible relationships
    • [x] Check for linearity
  1. FIT AN INITIAL MODEL:
    • Various forms, detail later...
    • Assessing the model:
      • Assess parameters (slope,intercept)
      • Check if the model explains the variation in the data (RMSE, F, R_square)
      • Are the coeffs, slopes, intercepts in appropriate units?
      • Whats the impact of collinearity? Can we ignore?
  2. Revise the fitted model
    • Multicollinearity is big issue for lin regression and cannot fully remove it
    • Use the predictive ability of model to test it (like R2 and RMSE)
    • Check for missed non-linearity
  3. Holdout validation / Train/test split
    • use sklearn train_test_split


OBTAIN:

  • Import required packages, read in dataframe, and definefunctions
In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import numpy as np
%matplotlib inline
inline_rc = dict(mpl.rcParams)

plt.style.use('seaborn')

import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy.stats import normaltest as normtest # D'Agostino and Pearson's omnibus test

from collections import Counter
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import MinMaxScaler

df = pd.read_csv('kc_house_data.csv')   
# Set index, create dataframe for dropped variables with id as index for both
df.set_index=('id')

drop_me =['id','lat','long','date']
df_dropped = df[drop_me].copy()
df_dropped.set_index('id')
df.drop(drop_me,axis=1,inplace=True)
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\matplotlib\__init__.py:886: MatplotlibDeprecationWarning: 
examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))
In [2]:
df.index
Out[2]:
RangeIndex(start=0, stop=21597, step=1)

Define functions to be used

def check_column(series, nlargest):

In [3]:
# Check columns returns the datatype, null values and unique values of input series 
def check_column(series,nlargest='all'):
    print(f"Column: df['{series.name}']':")
    print(f"dtype: {series.dtype}")
    print(f"isna: {series.isna().sum()} out of {len(series)} - {round(series.isna().sum()/len(series)*100,3)}%")
        
    print(f'\nUnique non-na values:') #,df['waterfront'].unique())
    if nlargest =='all':
        print(series.value_counts())
    else:
        print(series.value_counts().nlargest(nlargest))

def log_z(col):

  • Logs the column then z-scores
In [4]:
# define log + z-score
def log_z(col):
    
    col = np.array(col)
    logcol = np.log(col)
    
    zlogcol = (logcol-np.mean(logcol))/np.sqrt(np.var(logcol))
    
    return zlogcol

def rem_out_z(col_name):

  • Logs the column then z-scores and removes zscores>3
In [5]:
def rem_out_z(col_name):

    col = np.array(col_name)
    z_col = (col - np.mean(col)) / np.sqrt(np.var(col))
    z_col[abs(z_col)>3]=np.nan
    return z_col

def multiplot(df):

  • plots heat map of correlation values of input dataframe
In [6]:
# MULTIPLOT
from string import ascii_letters
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def multiplot(df):

    sns.set(style="white")

    # Compute the correlation matrix
    corr = df.corr()

    # Generate a mask for the upper triangle
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(16, 16))

    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(220, 10, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, annot=True, cmap=cmap, center=0,
                
    square=True, linewidths=.5, cbar_kws={"shrink": .5}) #

def plot_hist_scat(df,target,stats):

In [7]:
# Plots histogram and scatter (vs price) side by side
def plot_hist_scat(df,target='price',stats=False):
#     plt.style.use('bmh')
    plt.style.use('fivethirtyeight')
    
#     fontTitle = {'family': 'serif',
#             'color':  'black',
#             'weight': 'normal',
#             'size': 16,
#             }
#     fontTicks = {'family': 'sans-serif',
#             'color':  'black',
#             'weight': 'normal',
#             'size': 10,
#             }
    results = [['column','K_square','p-val']]

    for column in df.describe():

        fig = plt.figure(figsize=(8,3) )#plt.figaspect(0.5))#(5,4))
        
        ax1 = fig.add_subplot(121)
        ax1.hist(df[column],density=True,label = column+' histogram',bins=20)
        ax1.set_title(column.capitalize())

        ax1.legend()
        
        ax2 = fig.add_subplot(122)
        ax2.scatter(x=df[column], y=df[target],label = column+' vs price',marker='.')
        ax2.set_title(column.capitalize())
        ax2.legend()

        fig.tight_layout()
        if stats==True:
            stat, p = normtest(df[column])
#             print(f'Normality test for {column}:K_square = {stat}, p-value = {p}')

            results.append([column,stat, p])
    return pd.DataFrame(results)
In [ ]:
 

def plot_hist_scat_sns(df,target,stats):

In [8]:
#SEABORN
import matplotlib.ticker as mtick
import matplotlib.pyplot as plt

# Plots histogram and scatter (vs price) side by side
def plot_hist_scat_sns(df,target='price'):
    plt.style.use('dark_background')

    
    ## ----------- DEFINE AESTHETIC CUSTOMIZATIONS ----------- ##
    # Axis Label fonts
    fontTitle = {'fontsize': 16,
               'fontweight': 'bold',
                'fontfamily':'serif'}

    fontAxis = {'fontsize': 14,
               'fontweight': 'bold',
                'fontfamily':'serif'}

    fontTicks = {'fontsize': 12,
               'fontweight':'bold',
                'fontfamily':'serif'}

    # Formatting dollar sign labels
    fmtPrice = '${x:,.0f}'
    tickPrice = mtick.StrMethodFormatter(fmtPrice)
    

    ## ----------- PLOTTING ----------- ##
    
    ## Loop through dataframe to plot
    for column in df.describe():
    
        # Create figure with subplots for current column
        # Note: in order to use identical syntax for large # of subplots (ax[i,j]), 
        #  declare an extra row of subplots to be removed later
        fig, ax = plt.subplots(figsize=(12,10), ncols=2, nrows=2)

        ## ----- SUBPLOT 1 -----##
        i,j = 0,0
        ax[i,j].set_title(column.capitalize(),fontdict=fontTitle)
        
        # Define graphing keyword dictionaries for distplot (Subplot 1)
        hist_kws = {"linewidth": 1, "alpha": 1, "color": 'blue','edgecolor':'w'}
        kde_kws = {"color": "white", "linewidth": 1, "label": "KDE"}
        
        # Plot distplot on ax[i,j] using hist_kws and kde_kws
        sns.distplot(df[column], norm_hist=True, kde=True,
                     hist_kws = hist_kws, kde_kws = kde_kws,
                     label=column+' histogram', ax=ax[i,j])
 

        # Set x axis label
        ax[i,j].set_xlabel(column.title(),fontdict=fontAxis)
    
        # Get x-ticks, rotate labels, and return
        xticklab1 = ax[i,j].get_xticklabels(which = 'both')
        ax[i,j].set_xticklabels(labels=xticklab1, fontdict=fontTicks, rotation=45)
        ax[i,j].xaxis.set_major_formatter(mtick.ScalarFormatter())

        
        # Set y-label 
        ax[i,j].set_ylabel('Density',fontdict=fontAxis)
        yticklab1=ax[i,j].get_yticklabels(which='both')
        ax[i,j].set_yticklabels(labels=yticklab1,fontdict=fontTicks)
        ax[i,j].yaxis.set_major_formatter(mtick.ScalarFormatter())
        
        
        # Set y-grid
        ax[i, j].set_axisbelow(True)
        ax[i, j].grid(axis='y',ls='--')

        
        ## ----- SUBPLOT 2-----  ##
        i,j = 0,1
        ax[i,j].set_title(column.capitalize(),fontdict=fontTitle)

        # Define the ketword dictionaries for  scatter plot and regression line (subplot 2)
        line_kws={"color":"white","alpha":0.5,"lw":4,"ls":":"}
        scatter_kws={'s': 2, 'alpha': 0.5,'marker':'.','color':'blue'}

        # Plot regplot on ax[i,j] using line_kws and scatter_kws
        sns.regplot(df[column], df[target], 
                    line_kws = line_kws,
                    scatter_kws = scatter_kws,
                    ax=ax[i,j])
        
        # Set x-axis label
        ax[i,j].set_xlabel(column.title(),fontdict=fontAxis)

         # Get x ticks, rotate labels, and return
        xticklab2=ax[i,j].get_xticklabels(which='both')
        ax[i,j].set_xticklabels(labels=xticklab2,fontdict=fontTicks, rotation=45)
        ax[i,j].xaxis.set_major_formatter(mtick.ScalarFormatter())

        # Set  y-axis label
        ax[i,j].set_ylabel('Price',fontdict=fontAxis)
        
        # Get, set, and format y-axis Price labels
        yticklab = ax[i,j].get_yticklabels()
        ax[i,j].set_yticklabels(yticklab,fontdict=fontTicks)
        ax[i,j].get_yaxis().set_major_formatter(tickPrice) 

        # Set y-grid
        ax[i, j].set_axisbelow(True)
        ax[i, j].grid(axis='y',ls='--')       
        
        ## ---------- Final layout adjustments ----------- ##
        # Deleted unused subplots 
        fig.delaxes(ax[1,1])
        fig.delaxes(ax[1,0])

        # Optimizing spatial layout
        fig.tight_layout()
        figtitle=column+'_dist_regr_plots.png'
        plt.savefig(figtitle)
    return 

def detect_outliers(df,n,features): (using IQRs)

In [9]:
# Tukey's method using IQR to eliminate 
def detect_outliers(df,n,features):
    outlier_indices = []
    # iterate over features(columns)
    for col in features:
        # 1st quartile (25%)
        Q1 = np.percentile(df[col], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col],75)
        # Interquartile range (IQR)
        IQR = Q3 - Q1
        # outlier step
        outlier_step = 1.5 * IQR
        # Determine a list of indices of outliers for feature col
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step )].index
        # append the found outlier indices for col to the list of outlier indices 
        outlier_indices.extend(outlier_list_col)
        # select observations containing more than 2 outliers
        outlier_indices = Counter(outlier_indices)        
        multiple_outliers = list( k for k, v in outlier_indices.items() if v > n )
        return multiple_outliers 
# Outliers_to_drop = detect_outliers(data,2,["col1","col2"])
# df.loc[Outliers_to_drop] # Show the outliers rows
# Drop outliers
# data= data.drop(Outliers_to_drop, axis = 0).reset_index(drop=True)

SCRUB:

Initial inspection of dataframe, datatypes, and null values

In [10]:
df.head()
Out[10]:
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode sqft_living15 sqft_lot15
0 221900.0 3 1.00 1180 5650 1.0 NaN 0.0 3 7 1180 0.0 1955 0.0 98178 1340 5650
1 538000.0 3 2.25 2570 7242 2.0 0.0 0.0 3 7 2170 400.0 1951 1991.0 98125 1690 7639
2 180000.0 2 1.00 770 10000 1.0 0.0 0.0 3 6 770 0.0 1933 NaN 98028 2720 8062
3 604000.0 4 3.00 1960 5000 1.0 0.0 0.0 5 7 1050 910.0 1965 0.0 98136 1360 5000
4 510000.0 3 2.00 1680 8080 1.0 0.0 0.0 3 8 1680 0.0 1987 0.0 98074 1800 7503
In [11]:
# Display DataFrame Infro
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 17 columns):
price            21597 non-null float64
bedrooms         21597 non-null int64
bathrooms        21597 non-null float64
sqft_living      21597 non-null int64
sqft_lot         21597 non-null int64
floors           21597 non-null float64
waterfront       19221 non-null float64
view             21534 non-null float64
condition        21597 non-null int64
grade            21597 non-null int64
sqft_above       21597 non-null int64
sqft_basement    21597 non-null object
yr_built         21597 non-null int64
yr_renovated     17755 non-null float64
zipcode          21597 non-null int64
sqft_living15    21597 non-null int64
sqft_lot15       21597 non-null int64
dtypes: float64(6), int64(10), object(1)
memory usage: 2.8+ MB
In [12]:
# Check for columns with null values (remember strings/objects are not counted here)
res = df.isna().sum()
print(res[res>0],'\nTotal:',len(df))
# Waterfront, view, yr_renovated have missing values
waterfront      2376
view              63
yr_renovated    3842
dtype: int64 
Total: 21597

Fill in null values and recast variables for EDA

  • From examining the df.info and df.isna().sum(), there is 1 numerical data column that is currently text/object data type.
    • I will address these first since they would be excluded from preliminary visualizations.
  • There are also null values to address in waterfront, view, and yr_renovated
  • The Date column will be recast as datetime

zipcode

In [13]:
# Recast zipcode as a category
df['zipcode'] = df['zipcode'].astype('category')

sqft_basement

In [14]:
# RECASTING SQFT_BASEMENT
# Checking why sqft_basement might be an object:
# df['sqft_basement'].value_counts().nlargest(10)
check_column(df['sqft_basement'],10)
Column: df['sqft_basement']':
dtype: object
isna: 0 out of 21597 - 0.0%

Unique non-na values:
0.0       12826
?           454
600.0       217
500.0       209
700.0       208
800.0       201
400.0       184
1000.0      148
300.0       142
900.0       142
Name: sqft_basement, dtype: int64
  • sqft_basement is current an object, needs to be converted to int
    • Need to replace drop the 454 '?' values
    • There are a lot of 0's, for sqft_basement. Not sure if I should keep them in the dataset. I am for now.
    • Note: I originally replaced the ?'s with 0's, but am re-running the dataset with them dropped altogether.
In [15]:
# Removing rows with ? for sqft_basement and converting to float
print(len(df))
df['sqft_basement'].replace(to_replace = '?',value=np.nan,inplace=True) #,inplace=True)
df.dropna(axis=0,subset=['sqft_basement'],inplace=True)
df['sqft_basement'] = df['sqft_basement'].astype('float')
df['sqft_basement'].isna().sum()
check_column(df['sqft_basement'],10)
21597
Column: df['sqft_basement']':
dtype: float64
isna: 0 out of 21143 - 0.0%

Unique non-na values:
0.0       12826
600.0       217
500.0       209
700.0       208
800.0       201
400.0       184
1000.0      148
900.0       142
300.0       142
200.0       105
Name: sqft_basement, dtype: int64
In [16]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21143 entries, 0 to 21596
Data columns (total 17 columns):
price            21143 non-null float64
bedrooms         21143 non-null int64
bathrooms        21143 non-null float64
sqft_living      21143 non-null int64
sqft_lot         21143 non-null int64
floors           21143 non-null float64
waterfront       18804 non-null float64
view             21082 non-null float64
condition        21143 non-null int64
grade            21143 non-null int64
sqft_above       21143 non-null int64
sqft_basement    21143 non-null float64
yr_built         21143 non-null int64
yr_renovated     17389 non-null float64
zipcode          21143 non-null category
sqft_living15    21143 non-null int64
sqft_lot15       21143 non-null int64
dtypes: category(1), float64(7), int64(9)
memory usage: 2.8 MB

View

  • Replace the 61 null values with appropriate value for data type
  • Drop the 61 view null values
In [17]:
# Check for columns with null values (remember strings/objects are not counted here)
res = df.isna().sum()
print(res[res>0])
# Waterfront, view, yr_renovated have missing values
waterfront      2339
view              61
yr_renovated    3754
dtype: int64
In [18]:
check_column(df['view'])
Column: df['view']':
dtype: float64
isna: 61 out of 21143 - 0.289%

Unique non-na values:
0.0    19018
2.0      930
3.0      496
1.0      327
4.0      311
Name: view, dtype: int64
In [19]:
# Drop null values from view and re-check column
print(len(df))
df.dropna(axis=0,subset=['view'],inplace=True)
print(len(df))
check_column(df['view'])
21143
21082
Column: df['view']':
dtype: float64
isna: 0 out of 21082 - 0.0%

Unique non-na values:
0.0    19018
2.0      930
3.0      496
1.0      327
4.0      311
Name: view, dtype: int64
In [20]:
# Convert view to category and create a coded version called code_view
df['view'] = df['view'].astype('category')
df['code_view'] = df.view.cat.codes
df['code_view'].value_counts()
Out[20]:
0    19018
2      930
3      496
1      327
4      311
Name: code_view, dtype: int64
In [21]:
# Drop drop_me variable from main df, save in df_dropped
drop_me = 'view'
df_dropped[drop_me] = df[drop_me].copy()
df.drop(drop_me,axis=1,inplace=True)
In [22]:
#Recheck whole df
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21082 entries, 0 to 21596
Data columns (total 17 columns):
price            21082 non-null float64
bedrooms         21082 non-null int64
bathrooms        21082 non-null float64
sqft_living      21082 non-null int64
sqft_lot         21082 non-null int64
floors           21082 non-null float64
waterfront       18749 non-null float64
condition        21082 non-null int64
grade            21082 non-null int64
sqft_above       21082 non-null int64
sqft_basement    21082 non-null float64
yr_built         21082 non-null int64
yr_renovated     17340 non-null float64
zipcode          21082 non-null category
sqft_living15    21082 non-null int64
sqft_lot15       21082 non-null int64
code_view        21082 non-null int8
dtypes: category(1), float64(6), int64(9), int8(1)
memory usage: 2.6 MB

Waterfront

  • Recast as int string
  • Waterfront has 2339 null values to replace. make into unordered and coded into "NaN" category
    • Note this is string "NaN" not np.nan
In [23]:
check_column(df['waterfront'])
Column: df['waterfront']':
dtype: float64
isna: 2333 out of 21082 - 11.066%

Unique non-na values:
0.0    18608
1.0      141
Name: waterfront, dtype: int64
In [24]:
# Convert waterfront to category, replace null values with "NaN"' string to make it a category
df['waterfront'] = df['waterfront'].astype('category')
df['waterfront'].replace(np.nan,"NaN",inplace=True)
df['waterfront'] = df['waterfront'].astype('category')
In [25]:
# df['waterfront'].value_counts()
check_column(df['waterfront'],10)
Column: df['waterfront']':
dtype: category
isna: 0 out of 21082 - 0.0%

Unique non-na values:
0.0    18608
NaN     2333
1.0      141
Name: waterfront, dtype: int64
In [26]:
# Turn waterfront into coded column 
df['code_waterfront'] = df.waterfront.cat.codes
check_column(df['code_waterfront'])
Column: df['code_waterfront']':
dtype: int8
isna: 0 out of 21082 - 0.0%

Unique non-na values:
0    18608
2     2333
1      141
Name: code_waterfront, dtype: int64
In [27]:
# Drop drop_me variable from main df, save in df_dropped
drop_me = 'waterfront'
df_dropped[drop_me] = df[drop_me].copy()
df.drop(drop_me,axis=1,inplace=True)
In [28]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21082 entries, 0 to 21596
Data columns (total 17 columns):
price              21082 non-null float64
bedrooms           21082 non-null int64
bathrooms          21082 non-null float64
sqft_living        21082 non-null int64
sqft_lot           21082 non-null int64
floors             21082 non-null float64
condition          21082 non-null int64
grade              21082 non-null int64
sqft_above         21082 non-null int64
sqft_basement      21082 non-null float64
yr_built           21082 non-null int64
yr_renovated       17340 non-null float64
zipcode            21082 non-null category
sqft_living15      21082 non-null int64
sqft_lot15         21082 non-null int64
code_view          21082 non-null int8
code_waterfront    21082 non-null int8
dtypes: category(1), float64(5), int64(9), int8(2)
memory usage: 2.5 MB

yr_renovated --> is_reno

  • yr_renovated has 3754 null values
    • Presumably this is because the homes were never renovated
    • Will replace with 0.0
    • Replace all 0.0's with NaN
  • Replacing yr_renovated with binary category based on it if has been renovated or not
    • Make new column 'is_reno'
Make is_reno category (0 or 1)
In [29]:
check_column(df['yr_renovated'],10)
Column: df['yr_renovated']':
dtype: float64
isna: 3742 out of 21082 - 17.75%

Unique non-na values:
0.0       16618
2014.0       69
2003.0       31
2013.0       31
2007.0       30
2000.0       29
2005.0       29
1990.0       22
2009.0       21
2004.0       21
Name: yr_renovated, dtype: int64
In [30]:
# Fill in 0.0 yr_renovated as np.nan temporarily
df['yr_renovated'].replace(0.0,np.nan,inplace=True)

# Recheck for null values
check_column(df['yr_renovated'],10) #df.isna().sum()
Column: df['yr_renovated']':
dtype: float64
isna: 20360 out of 21082 - 96.575%

Unique non-na values:
2014.0    69
2013.0    31
2003.0    31
2007.0    30
2000.0    29
2005.0    29
1990.0    22
2009.0    21
2004.0    21
2006.0    20
Name: yr_renovated, dtype: int64
In [31]:
# Get indices of all np.nan values to be used for is_reno category below
idx_nan = df['yr_renovated'].loc[df['yr_renovated'].isna()].index
idx_notnan = df['yr_renovated'].loc[~df['yr_renovated'].isna()].index

# Make df['is_reno']
df['is_reno'] = df['yr_renovated'].copy
df['is_reno'][idx_nan] = 0
df['is_reno'][idx_notnan] = 1

check_column(df['is_reno'],10) 
Column: df['is_reno']':
dtype: int64
isna: 0 out of 21082 - 0.0%

Unique non-na values:
0    20360
1      722
Name: is_reno, dtype: int64
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\ipykernel_launcher.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
In [32]:
# Drop drop_me variable from main df, save in df_dropped
drop_me = 'yr_renovated'
df_dropped[drop_me] = df[drop_me].copy()
df.drop(drop_me,axis=1,inplace=True)

SCRUBBING THUS FAR...

  • Removed null values by dropping na from sqft_basement and view.
  • Converted waterfront to category and made NaN its own separate category (since there were so many null vlaues)
  • Converted yr_renovated to is_reno simple 0 or 1 value
  • Recase zipcodes as category since there is no numerical relationship between zipcode values
  • Next to inspect distributions and scatterplots to identify which numerical columns may be better analyzed as categories
In [33]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21082 entries, 0 to 21596
Data columns (total 17 columns):
price              21082 non-null float64
bedrooms           21082 non-null int64
bathrooms          21082 non-null float64
sqft_living        21082 non-null int64
sqft_lot           21082 non-null int64
floors             21082 non-null float64
condition          21082 non-null int64
grade              21082 non-null int64
sqft_above         21082 non-null int64
sqft_basement      21082 non-null float64
yr_built           21082 non-null int64
zipcode            21082 non-null category
sqft_living15      21082 non-null int64
sqft_lot15         21082 non-null int64
code_view          21082 non-null int8
code_waterfront    21082 non-null int8
is_reno            21082 non-null int64
dtypes: category(1), float64(4), int64(10), int8(2)
memory usage: 3.1 MB

Question 1: Which predictors should be analyzed as continuous data, vs binned/categorical data?

Identifying numerical data better analyzed as categorical

  • Will examine histogram distributions and scatter plots vs price for each variable in df.describe()
In [34]:
# Plot histograms and scatterplots vs target variable price for all numerical columns in df (show up in .describe())

fig = plot_hist_scat(df)
# fig = plot_hist_scat_sns(df)

Answer 1: How to treat each variable

  • Notes on histograms and scatterplots #### First, to comment on scatter plots that are indicative of categorical data:
  • Columns to be analyzed as categorical data: (cast as int for now, make sure its .astype('category').cat.as_ordered() one-hot coded later before running regression
    • Waterfront
    • View
    • Zipcode
  • Columns that are ordinal data. (e.g. 2 floors is indeed double 1 floor)
    • Floors
    • Bedrooms
    • Bathrooms
    • Condition
    • Grade
  • Columns that may be best analyzed by binning and casting as categorical data
    • Yr_built
    • is_reno
      • classified as simply renovated or not.**
    • sqft_basement
    • sqft_above
  • Numerical columns (that may be best analyzed as such)
    • All sqft categories
    • price
    • Note: moved sqft_basement to binned category to deal with 0's, also added sqft_above to accompany it
  • Numerical columns that were dropped
    • id
    • Lat
    • Long
    • Date
    • yr_renovated -> is_reno

Second, to comment on distributions

  • The target variable, price, seems a bit skewed and may be better analyzed as log-transformed.
    • Try both log-transformed and unaltered
  • All sqft columns seem to be skewed and should be transformed. (log)

List of vartypes/names

# Create tuples with columns names of categorical variables for easy extraction 
cat_vars = ('waterfront','view','zipcode')
ord_vars = ('grade','condition','floors','bedrooms','bathrooms')
vars_to_bin = ('yr_built','yr_renovated','sqft_above','sqft_basement')
num_vars = ('sqft_living', 'sqft_lot','sqft_living15', 'sqft_lot15')

COARSE-BINNING Numerical Data

  • yr_built, yr_renovated
  • Added sqft_basement due to 0 values
  • Added sqft_above to accompany basement

binning yr_built

In [35]:
# Check the min and max of the yr variables for binning range
df['yr_built'].describe()
Out[35]:
count    21082.000000
mean      1971.023337
std         29.323120
min       1900.000000
25%       1952.000000
50%       1975.000000
75%       1997.000000
max       2015.000000
Name: yr_built, dtype: float64
In [36]:
## Bin yr_built then make yr_built category
# yr_built min is 1900, max is 2015
bins = list(range(1900,2030,10))

df['yr_built'].replace(np.nan,0,inplace=True)
bins_yrbuilt = pd.cut(df['yr_built'], bins,include_lowest=True) # Cut into bins
# check_column(bins_yrbuilt)

df['bins_yrbuilt'] = bins_yrbuilt.astype('category').cat.as_ordered() #.cat.as_ordered()

# Inspecting the binned data counts
check_column(df['bins_yrbuilt'])
Column: df['bins_yrbuilt']':
dtype: category
isna: 0 out of 21082 - 0.0%

Unique non-na values:
(2000.0, 2010.0]      3354
(1960.0, 1970.0]      2484
(1950.0, 1960.0]      2411
(1970.0, 1980.0]      2351
(1980.0, 1990.0]      2297
(1990.0, 2000.0]      2085
(1940.0, 1950.0]      1806
(1920.0, 1930.0]      1142
(2010.0, 2020.0]      1066
(1899.999, 1910.0]     759
(1910.0, 1920.0]       747
(1930.0, 1940.0]       580
Name: bins_yrbuilt, dtype: int64
In [37]:
# Drop original 
drop_me = 'yr_built'
df_dropped[drop_me] = df[drop_me].copy()
df.drop(drop_me,axis=1,inplace=True)

binning sqft_basement

In [38]:
df['sqft_basement'].describe()
Out[38]:
count    21082.000000
mean       291.359975
std        442.007858
min          0.000000
25%          0.000000
50%          0.000000
75%        560.000000
max       4820.000000
Name: sqft_basement, dtype: float64
In [39]:
check_column(df['sqft_basement'],10)
Column: df['sqft_basement']':
dtype: float64
isna: 0 out of 21082 - 0.0%

Unique non-na values:
0.0       12798
600.0       216
500.0       209
700.0       207
800.0       201
400.0       184
1000.0      147
900.0       142
300.0       140
200.0       105
Name: sqft_basement, dtype: int64
In [40]:
# Definine bins where bins 0-1 is its own interval
bins=[-np.inf]
[bins.append(x) for x in range(1,5500,500)]
# bins

# cut_basement = df['sqft_basement'].replace(0,np.nan)
cut_basement = df['sqft_basement'].replace(np.nan,0).copy()
# cut_basement = cut_basement.replace('NaN',0)

bins_sqftbase = pd.cut(cut_basement, bins=bins, include_lowest=True) # Cut into bins
df['bins_sqftbasement'] = bins_sqftbase.copy()

# Cast as ordered category
df['bins_sqftbasement'] = df['bins_sqftbasement'].astype('category').cat.as_ordered()

# Check result
check_column(df['bins_sqftbasement'])
Column: df['bins_sqftbasement']':
dtype: category
isna: 0 out of 21082 - 0.0%

Unique non-na values:
(-inf, 1.0]         12798
(501.0, 1001.0]      3849
(1.0, 501.0]         2617
(1001.0, 1501.0]     1476
(1501.0, 2001.0]      269
(2001.0, 2501.0]       57
(2501.0, 3001.0]       11
(3001.0, 3501.0]        3
(4501.0, 5001.0]        1
(4001.0, 4501.0]        1
(3501.0, 4001.0]        0
Name: bins_sqftbasement, dtype: int64
In [41]:
# Drop original 
drop_me = 'sqft_basement'
df_dropped[drop_me] = df[drop_me].copy()
df.drop(drop_me,axis=1,inplace=True)

binning sqft_above

In [42]:
check_column(df['sqft_above'],10)
df['sqft_above'].describe()
Column: df['sqft_above']':
dtype: int64
isna: 0 out of 21082 - 0.0%

Unique non-na values:
1300    208
1010    206
1200    197
1220    190
1140    181
1180    174
1400    174
1340    173
1060    173
1250    168
Name: sqft_above, dtype: int64
Out[42]:
count    21082.000000
mean      1789.000000
std        828.369355
min        370.000000
25%       1200.000000
50%       1560.000000
75%       2210.000000
max       9410.000000
Name: sqft_above, dtype: float64
In [43]:
# sqft_above 
# Bins to cover range seen above in .describe
bins = list(range(0,9501,500))

# cut_above = df['sqft_above'].replace(0,np.nan)
bins_sqftabove = pd.cut(df['sqft_above'], bins=bins, include_lowest=True) # Cut into bins, including left edge 
check_column(bins_sqftabove)
Column: df['sqft_above']':
dtype: category
isna: 0 out of 21082 - 0.0%

Unique non-na values:
(1000.0, 1500.0]    7352
(1500.0, 2000.0]    4686
(2000.0, 2500.0]    2764
(500.0, 1000.0]     2488
(2500.0, 3000.0]    1815
(3000.0, 3500.0]    1088
(3500.0, 4000.0]     466
(4000.0, 4500.0]     239
(4500.0, 5000.0]      80
(5000.0, 5500.0]      42
(-0.001, 500.0]       19
(6000.0, 6500.0]      16
(5500.0, 6000.0]      14
(6500.0, 7000.0]       4
(7500.0, 8000.0]       3
(8500.0, 9000.0]       2
(7000.0, 7500.0]       2
(8000.0, 8500.0]       1
(9000.0, 9500.0]       1
Name: sqft_above, dtype: int64
In [44]:
bins_sqftabove.replace(np.nan,'NaN',inplace=True)
df['bins_sqftabove']=bins_sqftabove.astype('category').cat.as_ordered()

check_column(df['bins_sqftabove'])
Column: df['bins_sqftabove']':
dtype: category
isna: 0 out of 21082 - 0.0%

Unique non-na values:
(1000.0, 1500.0]    7352
(1500.0, 2000.0]    4686
(2000.0, 2500.0]    2764
(500.0, 1000.0]     2488
(2500.0, 3000.0]    1815
(3000.0, 3500.0]    1088
(3500.0, 4000.0]     466
(4000.0, 4500.0]     239
(4500.0, 5000.0]      80
(5000.0, 5500.0]      42
(-0.001, 500.0]       19
(6000.0, 6500.0]      16
(5500.0, 6000.0]      14
(6500.0, 7000.0]       4
(7500.0, 8000.0]       3
(8500.0, 9000.0]       2
(7000.0, 7500.0]       2
(8000.0, 8500.0]       1
(9000.0, 9500.0]       1
Name: bins_sqftabove, dtype: int64
In [45]:
# Drop original 
drop_me = 'sqft_above'
df_dropped[drop_me] = df[drop_me].copy()
df.drop(drop_me,axis=1,inplace=True)

CHECKING FOR MULTICOLLINEARITY

In [46]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21082 entries, 0 to 21596
Data columns (total 17 columns):
price                21082 non-null float64
bedrooms             21082 non-null int64
bathrooms            21082 non-null float64
sqft_living          21082 non-null int64
sqft_lot             21082 non-null int64
floors               21082 non-null float64
condition            21082 non-null int64
grade                21082 non-null int64
zipcode              21082 non-null category
sqft_living15        21082 non-null int64
sqft_lot15           21082 non-null int64
code_view            21082 non-null int8
code_waterfront      21082 non-null int8
is_reno              21082 non-null int64
bins_yrbuilt         21082 non-null category
bins_sqftbasement    21082 non-null category
bins_sqftabove       21082 non-null category
dtypes: category(4), float64(3), int64(8), int8(2)
memory usage: 2.7 MB
In [47]:
# to_drop_copy = ['sqft_above','sqft_basement','yr_built','waterfront','view']
# df_dropped = df[to_drop_copy].copy()
# # df.drop(to_drop_copy,axis=1,inplace=True)
In [48]:
# Plot correlation  heatmaps for all data 
# pause
to_drop = ['price']
multiplot(df.drop(to_drop,axis=1))
In [49]:
# # Test which multicol variables to keep
# pause
# df[['price','sqft_living','sqft_living15','sqft_above','sqft_basement']].corr()

Answer 2:

  • Sqft_living is highly correlated with sqft_living15 and grade,
  • These correlations make sense sense since neighborhoods probably have similar construction.
    • The r values are ~0.75 (threshold) and have sufficient intuitive rationale to keep.
In [50]:
# # Save copy of data in current form.
# df.to_csv('scrubbed_data.csv')
# df_dropped.to_csv('dropped_data.csv')

EXPLORE:

  • Re-examining the basic stats and distributions of the data.
  • Decide on transformations to perform
    • Normalize afterwards.
  • Visually examine for possible relationships

EDA before normalization/transformation

Examine basic descriptive stats

In [51]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21082 entries, 0 to 21596
Data columns (total 17 columns):
price                21082 non-null float64
bedrooms             21082 non-null int64
bathrooms            21082 non-null float64
sqft_living          21082 non-null int64
sqft_lot             21082 non-null int64
floors               21082 non-null float64
condition            21082 non-null int64
grade                21082 non-null int64
zipcode              21082 non-null category
sqft_living15        21082 non-null int64
sqft_lot15           21082 non-null int64
code_view            21082 non-null int8
code_waterfront      21082 non-null int8
is_reno              21082 non-null int64
bins_yrbuilt         21082 non-null category
bins_sqftbasement    21082 non-null category
bins_sqftabove       21082 non-null category
dtypes: category(4), float64(3), int64(8), int8(2)
memory usage: 2.7 MB
In [52]:
df.head()
Out[52]:
price bedrooms bathrooms sqft_living sqft_lot floors condition grade zipcode sqft_living15 sqft_lot15 code_view code_waterfront is_reno bins_yrbuilt bins_sqftbasement bins_sqftabove
0 221900.0 3 1.00 1180 5650 1.0 3 7 98178 1340 5650 0 2 0 (1950.0, 1960.0] (-inf, 1.0] (1000.0, 1500.0]
1 538000.0 3 2.25 2570 7242 2.0 3 7 98125 1690 7639 0 0 1 (1950.0, 1960.0] (1.0, 501.0] (2000.0, 2500.0]
2 180000.0 2 1.00 770 10000 1.0 3 6 98028 2720 8062 0 0 0 (1930.0, 1940.0] (-inf, 1.0] (500.0, 1000.0]
3 604000.0 4 3.00 1960 5000 1.0 5 7 98136 1360 5000 0 0 0 (1960.0, 1970.0] (501.0, 1001.0] (1000.0, 1500.0]
4 510000.0 3 2.00 1680 8080 1.0 3 8 98074 1800 7503 0 0 0 (1980.0, 1990.0] (-inf, 1.0] (1500.0, 2000.0]
In [53]:
df.describe()
Out[53]:
price bedrooms bathrooms sqft_living sqft_lot floors condition grade sqft_living15 sqft_lot15 code_view code_waterfront is_reno
count 2.108200e+04 21082.000000 21082.000000 21082.000000 2.108200e+04 21082.00000 21082.000000 21082.000000 21082.000000 21082.000000 21082.000000 21082.000000 21082.000000
mean 5.402469e+05 3.372403 2.115916 2080.359975 1.507759e+04 1.49362 3.409828 7.657717 1986.917418 12732.514135 0.233327 0.228014 0.034247
std 3.667323e+05 0.924996 0.768142 917.856396 4.117338e+04 0.53937 0.650597 1.173690 685.544250 27148.781580 0.765066 0.630372 0.181868
min 7.800000e+04 1.000000 0.500000 370.000000 5.200000e+02 1.00000 1.000000 3.000000 399.000000 651.000000 0.000000 0.000000 0.000000
25% 3.220000e+05 3.000000 1.750000 1430.000000 5.040000e+03 1.00000 3.000000 7.000000 1490.000000 5100.000000 0.000000 0.000000 0.000000
50% 4.500000e+05 3.000000 2.250000 1910.000000 7.620000e+03 1.50000 3.000000 7.000000 1840.000000 7626.000000 0.000000 0.000000 0.000000
75% 6.450000e+05 4.000000 2.500000 2550.000000 1.069775e+04 2.00000 4.000000 8.000000 2360.000000 10088.750000 0.000000 0.000000 0.000000
max 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.50000 5.000000 13.000000 6210.000000 871200.000000 4.000000 2.000000 1.000000

Notes on basic statistics

  • Bedrooms has some very clear outliers (max is 33, but 75% quartile is only 4)
    • May want to remove outliers after Z-scoring (Absolute Z-score > 3)
  • Same with bathrooms (8 is max, 75% quartile is only 2.5)
  • Same with sqft_living (max 13540, 75% quartile = 2550)
  • Also same with sqft_lot15, sqftliving15

Visualizing numerical data

  • Distributions and scatterplots
  • Note: May want to cast all categoricals as strings/categories for visualization
    • Keeping as is for now
In [54]:
# Separate out numerical 
df.describe()
Out[54]:
price bedrooms bathrooms sqft_living sqft_lot floors condition grade sqft_living15 sqft_lot15 code_view code_waterfront is_reno
count 2.108200e+04 21082.000000 21082.000000 21082.000000 2.108200e+04 21082.00000 21082.000000 21082.000000 21082.000000 21082.000000 21082.000000 21082.000000 21082.000000
mean 5.402469e+05 3.372403 2.115916 2080.359975 1.507759e+04 1.49362 3.409828 7.657717 1986.917418 12732.514135 0.233327 0.228014 0.034247
std 3.667323e+05 0.924996 0.768142 917.856396 4.117338e+04 0.53937 0.650597 1.173690 685.544250 27148.781580 0.765066 0.630372 0.181868
min 7.800000e+04 1.000000 0.500000 370.000000 5.200000e+02 1.00000 1.000000 3.000000 399.000000 651.000000 0.000000 0.000000 0.000000
25% 3.220000e+05 3.000000 1.750000 1430.000000 5.040000e+03 1.00000 3.000000 7.000000 1490.000000 5100.000000 0.000000 0.000000 0.000000
50% 4.500000e+05 3.000000 2.250000 1910.000000 7.620000e+03 1.50000 3.000000 7.000000 1840.000000 7626.000000 0.000000 0.000000 0.000000
75% 6.450000e+05 4.000000 2.500000 2550.000000 1.069775e+04 2.00000 4.000000 8.000000 2360.000000 10088.750000 0.000000 0.000000 0.000000
max 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.50000 5.000000 13.000000 6210.000000 871200.000000 4.000000 2.000000 1.000000
In [55]:
# INSPECTING NUMERICAL DATA DISTPLOTS + KDE
plt.style.use('seaborn')
plot_hist_scat_sns(df)
# for column in df.describe().columns:
    
#     sns.distplot(df[column],kde=True,label = column+' histogram',norm_hist=True)
#     title = column+' histogram & KDE'
#     plt.title(title.title())
#     plt.ylabel('Density')
#     plt.legend()
#     plt.show()

# Uncomment below if want new scatter_matrix
# fig = pd.plotting.scatter_matrix(df_clean,figsize=(12,12))

Notes on distplots

  • Essentialy all numerical data seems to be at least slightly skewed.
    • Do not think it is sufficient to log-transform the data and lose model interpretability

Visualizing categorical data

In [56]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21082 entries, 0 to 21596
Data columns (total 17 columns):
price                21082 non-null float64
bedrooms             21082 non-null int64
bathrooms            21082 non-null float64
sqft_living          21082 non-null int64
sqft_lot             21082 non-null int64
floors               21082 non-null float64
condition            21082 non-null int64
grade                21082 non-null int64
zipcode              21082 non-null category
sqft_living15        21082 non-null int64
sqft_lot15           21082 non-null int64
code_view            21082 non-null int8
code_waterfront      21082 non-null int8
is_reno              21082 non-null int64
bins_yrbuilt         21082 non-null category
bins_sqftbasement    21082 non-null category
bins_sqftabove       21082 non-null category
dtypes: category(4), float64(3), int64(8), int8(2)
memory usage: 2.7 MB
In [57]:
# Define list of all categorical variables 
list_cat_vars = ['zipcode', 'bins_yrbuilt', 'bins_sqftbasement', 'bins_sqftabove','condition','grade','code_view','code_waterfront']
In [58]:
# Plotting scatterplots and boxplots for categorical data
plt.style.use('seaborn')
for column in list_cat_vars:
   
    fig = plt.figure(figsize=(12,4))

    ax1  = fig.add_subplot(121)
    ax1 = sns.stripplot(x=df[column],y=df['price'],marker='.') 
    # ax1.set_title('Price vs ',column,' scatter plot')
    title1=column+' scatter'
    ax1.set_title(title1.title())
    ax1.set_xlabel(column)
    ax1.set_ylabel('Price')

    
    ax2 = fig.add_subplot(122)
    ax2 = sns.boxplot(x=df[column],y=df['price']) 
    title2=column+' boxplot'
    ax2.set_title(title2.title())
    ax2.set_xlabel(column)
    ax2.set_ylabel('Price')
    fig.tight_layout()

Question/Answer 3A: Which categorical variables show the greatest potential as predictors?

ANSWER 3A:

Notes on categorical scatter plots

  • grade seems to be strongly related to price (notice how the whole range of values seems to increase in price at higher grades.
  • Zipcodes look to differ quite a bit in terms of price.
  • Condition looks to be highly related to price (at least above condition level 2)
  • View does not look as strongly related to price as I expected.
  • Floors do not seem as related as expected
  • yr_built may have some relationship with price
  • yr_renovated has somewhat of a trend, but recent renovations buck this trend
In [59]:
# # INSPECTING REGRESSION PLOTS
# plt.style.use('seaborn')

# plot_vars=df.describe().columns

# for column in plot_vars:
# #     df_plot=df[column]
# #     df_plot = df.loc[df[column]>0]
#     plot= sns.regplot(df[column], df['price'],robust=False,marker='.') #kde=True,label = column+' histogram')
# #     plot = sns.boxplot(df[column],df['price'])
#     title = column+' linear regression'
#     plt.title(title.title())
# #     plt.legend()
#     plt.show()
In [60]:
df.describe()
Out[60]:
price bedrooms bathrooms sqft_living sqft_lot floors condition grade sqft_living15 sqft_lot15 code_view code_waterfront is_reno
count 2.108200e+04 21082.000000 21082.000000 21082.000000 2.108200e+04 21082.00000 21082.000000 21082.000000 21082.000000 21082.000000 21082.000000 21082.000000 21082.000000
mean 5.402469e+05 3.372403 2.115916 2080.359975 1.507759e+04 1.49362 3.409828 7.657717 1986.917418 12732.514135 0.233327 0.228014 0.034247
std 3.667323e+05 0.924996 0.768142 917.856396 4.117338e+04 0.53937 0.650597 1.173690 685.544250 27148.781580 0.765066 0.630372 0.181868
min 7.800000e+04 1.000000 0.500000 370.000000 5.200000e+02 1.00000 1.000000 3.000000 399.000000 651.000000 0.000000 0.000000 0.000000
25% 3.220000e+05 3.000000 1.750000 1430.000000 5.040000e+03 1.00000 3.000000 7.000000 1490.000000 5100.000000 0.000000 0.000000 0.000000
50% 4.500000e+05 3.000000 2.250000 1910.000000 7.620000e+03 1.50000 3.000000 7.000000 1840.000000 7626.000000 0.000000 0.000000 0.000000
75% 6.450000e+05 4.000000 2.500000 2550.000000 1.069775e+04 2.00000 4.000000 8.000000 2360.000000 10088.750000 0.000000 0.000000 0.000000
max 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.50000 5.000000 13.000000 6210.000000 871200.000000 4.000000 2.000000 1.000000

[SCRUB-2] NORMALIZING & TRANSFORMING

Graphing raw vs normalized results to decide if dataset should be normalized

  • log-transform and z-score numerical data.
  • Am using detect_outlier function to apply Tukey's method for outlier remnoval based on IQR

Outlier Removal - visualizing

  • Will be using pre-defined function detect_outliers(df,n,var_name)
    • Returns index of rows containing outliers based in IQR
In [61]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21082 entries, 0 to 21596
Data columns (total 17 columns):
price                21082 non-null float64
bedrooms             21082 non-null int64
bathrooms            21082 non-null float64
sqft_living          21082 non-null int64
sqft_lot             21082 non-null int64
floors               21082 non-null float64
condition            21082 non-null int64
grade                21082 non-null int64
zipcode              21082 non-null category
sqft_living15        21082 non-null int64
sqft_lot15           21082 non-null int64
code_view            21082 non-null int8
code_waterfront      21082 non-null int8
is_reno              21082 non-null int64
bins_yrbuilt         21082 non-null category
bins_sqftbasement    21082 non-null category
bins_sqftabove       21082 non-null category
dtypes: category(4), float64(3), int64(8), int8(2)
memory usage: 2.7 MB
In [62]:
# Define variables to plot vs tukey-cleaned outliers 
vars_to_norm = ['sqft_living','sqft_lot','sqft_living15','sqft_lot15','bedrooms','bathrooms']
df.describe().columns
Out[62]:
Index(['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'condition', 'grade', 'sqft_living15', 'sqft_lot15', 'code_view',
       'code_waterfront', 'is_reno'],
      dtype='object')
In [63]:
from collections import Counter
from sklearn.preprocessing import RobustScaler
robscaler = RobustScaler()

# vars_to_norm = ['sqft_living','sqft_lot','sqft_living15','sqft_lot15','bedrooms','bathrooms']
norm_results = [['column','K_square','p-val']]

# Graph all potential normalizedvariables
for var_name in df.describe():

    var = df[var_name]
    fig = plt.figure(figsize=(12,4))
    ax1 = fig.add_subplot(121)
    ax1 = sns.distplot(var, norm_hist=True) 
    ax1.set_title('Raw '+var_name)
    #robscaler.fit_transform(np.array(var).reshape(-1,1)
    
    # OUTLIER REMOVAL
    outliers_to_drop = detect_outliers(df,0,[var_name])
    var_clean =df[var_name].drop(outliers_to_drop)
    
    ax2 = fig.add_subplot(122)
    
    ax2 = sns.distplot(var_clean,norm_hist=True)
#     ax2 = sns.distplot(robscaler.fit_transform(np.array(var_clean).reshape(-1,1)),norm_hist=True)
    
    ax2.set_title('Tukey Outliers Removed '+var_name) #+var)
    ax2.set_xlabel('Scale')
    stat, p = normtest(var_clean)
#     norm_results.append([var_clean,stat, p])
# norm_results = pd.DataFrame(norm_results[2:],columns=norm_results[0])#,index='columns')
# norm_results
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\statsmodels\nonparametric\kde.py:488: RuntimeWarning: invalid value encountered in true_divide
  binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\statsmodels\nonparametric\kdetools.py:34: RuntimeWarning: invalid value encountered in double_scalars
  FAC1 = 2*(np.pi*bw/RANGE)**2
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\numpy\core\fromnumeric.py:83: RuntimeWarning: invalid value encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

Question/Answer 3B: Does removal of outliers improve the distrubtions?

  • The data is skewed by outliers.
    • Comparing it to IQR-method outlier elimination reveals much improved results
    • The distributions look much more reasonabile with outliers removed.

REMOVING OUTLIERS

Recall stats observations from beginning of EDA:

Notes on basic statistics

  • Bedrooms has some very clear outliers (max is 33, but 75% quartile is only 4)
    • May want to remove outliers after Z-scoring (Absolute Z-score > 3)
  • Same with bathrooms (8 is max, 75% quartile is only 2.5)
  • Same with sqft_living (max 13540, 75% quartile = 2550)
  • Also same with sqft_lot15, sqftliving15
In [64]:
# DEFINE VARIABLES TO GET THE OUTLIERS FOR (based on observations)
# vars_to_norm = ['price','bedrooms''sqft_living','sqft_lot','sqft_living15','sqft_lot15','bedrooms','bathrooms'
vars_to_clean = ['price','bedrooms','sqft_living','bathrooms','sqft_living15']
In [65]:
# GET OUTLIER INDICES AND REPORT 

outliers_to_drop = {}

for col in vars_to_clean:
    outliers_to_drop[col] = detect_outliers(df,0,[col])
# outliers_to_drop.keys()
# outliers_to_drop.values()

# Print out # of outliers
for k, v in outliers_to_drop.items():
    print(f'col: {k} has {len(v)} outliers. ({round(len(v)/len(df),2)*100}%)' )
col: price has 1131 outliers. (5.0%)
col: bedrooms has 511 outliers. (2.0%)
col: sqft_living has 557 outliers. (3.0%)
col: bathrooms has 546 outliers. (3.0%)
col: sqft_living15 has 532 outliers. (3.0%)

Filling in df_norm

In [66]:
# Intialize df_norm with df's values
df_norm=df.copy()
In [67]:
# Iterate throught outliers_to_drop dictionary to replace outliers with np.nan
for k, v in outliers_to_drop.items():
    df_norm.loc[v,k] = np.nan # axis=0,inplace=True)

# Display null values
df_norm.isna().sum()
Out[67]:
price                1131
bedrooms              511
bathrooms             546
sqft_living           557
sqft_lot                0
floors                  0
condition               0
grade                   0
zipcode                 0
sqft_living15         532
sqft_lot15              0
code_view               0
code_waterfront         0
is_reno                 0
bins_yrbuilt            0
bins_sqftbasement       0
bins_sqftabove          0
dtype: int64
In [68]:
# REMOVING OUTLIERS FROM PRICE
drop_col = 'price'

print(f'for {drop_col}:')
print(f'# of intial rows: {len(df_norm)}')
df_norm.dropna(subset=['price'],inplace=True)
print(f'# after dropping rows: {len(df_norm)}')
print(f'\nOutliers remaining: \n{df_norm.isna().sum()}')
for price:
# of intial rows: 21082
# after dropping rows: 19951

Outliers remaining: 
price                  0
bedrooms             449
bathrooms            253
sqft_living          169
sqft_lot               0
floors                 0
condition              0
grade                  0
zipcode                0
sqft_living15        231
sqft_lot15             0
code_view              0
code_waterfront        0
is_reno                0
bins_yrbuilt           0
bins_sqftbasement      0
bins_sqftabove         0
dtype: int64
In [69]:
# REMOVING OUTLIERS FROM BEDROOMS
drop_col = 'bedrooms'

print(f'for {drop_col}:')
print(f'# of intial rows: {len(df_norm)}')
df_norm.dropna(subset=[drop_col],inplace=True)
print(f'# after dropping rows: {len(df_norm)}')
print(f'\nOutliers remaining: \n{df_norm.isna().sum()}')
for bedrooms:
# of intial rows: 19951
# after dropping rows: 19502

Outliers remaining: 
price                  0
bedrooms               0
bathrooms            196
sqft_living          150
sqft_lot               0
floors                 0
condition              0
grade                  0
zipcode                0
sqft_living15        228
sqft_lot15             0
code_view              0
code_waterfront        0
is_reno                0
bins_yrbuilt           0
bins_sqftbasement      0
bins_sqftabove         0
dtype: int64
In [70]:
# REMOVING OUTLIERS FROM PRICE

drop_col = 'sqft_living'

print(f'for {drop_col}:')
print(f'# of intial rows: {len(df_norm)}')
df_norm.dropna(subset=[drop_col],inplace=True)
print(f'# after dropping rows: {len(df_norm)}')
print(f'\nOutliers remaining: \n{df_norm.isna().sum()}')
for sqft_living:
# of intial rows: 19502
# after dropping rows: 19352

Outliers remaining: 
price                  0
bedrooms               0
bathrooms            147
sqft_living            0
sqft_lot               0
floors                 0
condition              0
grade                  0
zipcode                0
sqft_living15        176
sqft_lot15             0
code_view              0
code_waterfront        0
is_reno                0
bins_yrbuilt           0
bins_sqftbasement      0
bins_sqftabove         0
dtype: int64
In [71]:
# REMOVING OUTLIERS FROM BATHROOMS
drop_col = 'bathrooms'

print(f'for {drop_col}:')
print(f'# of intial rows: {len(df_norm)}')
df_norm.dropna(subset=[drop_col],inplace=True)
print(f'# after dropping rows: {len(df_norm)}')
print(f'\nOutliers remaining: \n{df_norm.isna().sum()}')
for bathrooms:
# of intial rows: 19352
# after dropping rows: 19205

Outliers remaining: 
price                  0
bedrooms               0
bathrooms              0
sqft_living            0
sqft_lot               0
floors                 0
condition              0
grade                  0
zipcode                0
sqft_living15        152
sqft_lot15             0
code_view              0
code_waterfront        0
is_reno                0
bins_yrbuilt           0
bins_sqftbasement      0
bins_sqftabove         0
dtype: int64
In [72]:
drop_col = 'sqft_living15'

print(f'for {drop_col}:')
print(f'# of intial rows: {len(df_norm)}')
df_norm.dropna(subset=[drop_col],inplace=True)
print(f'# after dropping rows: {len(df_norm)}')
print(f'\nOutliers remaining: \n{df_norm.isna().sum()}')
for sqft_living15:
# of intial rows: 19205
# after dropping rows: 19053

Outliers remaining: 
price                0
bedrooms             0
bathrooms            0
sqft_living          0
sqft_lot             0
floors               0
condition            0
grade                0
zipcode              0
sqft_living15        0
sqft_lot15           0
code_view            0
code_waterfront      0
is_reno              0
bins_yrbuilt         0
bins_sqftbasement    0
bins_sqftabove       0
dtype: int64

NORMALIZING UNITS (RobustScaler)

In [73]:
# ADDING OUTLIER REMOVAL FROM preprocessing.RobuseScaler
from sklearn.preprocessing import RobustScaler

robscaler = RobustScaler()
robscaler
Out[73]:
RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)
In [74]:
vars_to_scale = ['sqft_living','sqft_lot','sqft_living15','sqft_lot15','bedrooms','bathrooms']

for col in vars_to_scale:
            
    col_data = np.array(np.array(df_norm[col]))
    res = robscaler.fit_transform(col_data.reshape(-1,1)) #,df['price'])
    df_norm['sca_'+col] = res.flatten()
In [75]:
# IF DROPPING VARS UNCOMMENT BELOW
# df_norm.drop(vars_to_scale,axis=1,inplace=True)
In [76]:
df_norm.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 23 columns):
price                19053 non-null float64
bedrooms             19053 non-null float64
bathrooms            19053 non-null float64
sqft_living          19053 non-null float64
sqft_lot             19053 non-null int64
floors               19053 non-null float64
condition            19053 non-null int64
grade                19053 non-null int64
zipcode              19053 non-null category
sqft_living15        19053 non-null float64
sqft_lot15           19053 non-null int64
code_view            19053 non-null int8
code_waterfront      19053 non-null int8
is_reno              19053 non-null int64
bins_yrbuilt         19053 non-null category
bins_sqftbasement    19053 non-null category
bins_sqftabove       19053 non-null category
sca_sqft_living      19053 non-null float64
sca_sqft_lot         19053 non-null float64
sca_sqft_living15    19053 non-null float64
sca_sqft_lot15       19053 non-null float64
sca_bedrooms         19053 non-null float64
sca_bathrooms        19053 non-null float64
dtypes: category(4), float64(12), int64(5), int8(2)
memory usage: 2.7 MB
In [77]:
df_norm.head()
Out[77]:
price bedrooms bathrooms sqft_living sqft_lot floors condition grade zipcode sqft_living15 ... is_reno bins_yrbuilt bins_sqftbasement bins_sqftabove sca_sqft_living sca_sqft_lot sca_sqft_living15 sca_sqft_lot15 sca_bedrooms sca_bathrooms
0 221900.0 3.0 1.00 1180.0 5650 1.0 3 7 98178 1340.0 ... 0 (1950.0, 1960.0] (-inf, 1.0] (1000.0, 1500.0] -0.663265 -0.355075 -0.569620 -0.391872 0.0 -1.00
1 538000.0 3.0 2.25 2570.0 7242 2.0 3 7 98125 1690.0 ... 1 (1950.0, 1960.0] (1.0, 501.0] (2000.0, 2500.0] 0.755102 -0.047858 -0.126582 0.026953 0.0 0.25
2 180000.0 2.0 1.00 770.0 10000 1.0 3 6 98028 2720.0 ... 0 (1930.0, 1940.0] (-inf, 1.0] (500.0, 1000.0] -1.081633 0.484369 1.177215 0.116024 -1.0 -1.00
3 604000.0 4.0 3.00 1960.0 5000 1.0 5 7 98136 1360.0 ... 0 (1960.0, 1970.0] (501.0, 1001.0] (1000.0, 1500.0] 0.132653 -0.480509 -0.544304 -0.528743 1.0 1.00
4 510000.0 3.0 2.00 1680.0 8080 1.0 3 8 98074 1800.0 ... 0 (1980.0, 1990.0] (-inf, 1.0] (1500.0, 2000.0] -0.153061 0.113856 0.012658 -0.001685 0.0 0.00

5 rows × 23 columns

Normalizing by log+z-score (OLD)

In [78]:
# # Normalize chosen vars
# # df_norm = pd.DataFrame({})
# df_norm['logz_price'] = log_z(df['price'])
# df_norm['logz_sqft_living'] = log_z(df['sqft_living'])a
# df_norm['logz_sqft_living15'] = log_z(df['sqft_living15'])
# df_norm['logz_sqft_lot'] = log_z(df['sqft_lot'])
# df_norm['logz_sqft_lot15'] = log_z(df['sqft_lot15'])
# df_norm.index = df.index

Normalizing to z-score + removing >3 STD (OLD)

In [79]:
# Z-SCORE AND REMOVE >3 STD 
# vars_dont_norm = ['yr_renovated','yr_built','floors','view','grade','condition']
# vars_to_norm = df.describe().drop(vars_dont_norm,axis=1).columns
# vars_to_rem_out = None

# col_names =vars_to_rem_out
# df_norm = df[vars_to_norm].copy()
# df_norm['id'] = df['id'].copy()
# df_norm.set_index('id')
# df_norm.info()
                  
# for col in col_names:
#       if col != 'id':
#         df_norm['zClean_'+col] = rem_out_z(df[col])
#     # df_norm  = df_norm.filter(regex=('log'),axis=1)
# df_norm.info()
In [80]:
# Drop na
# # df_run = pd.DataFrame({)
# temp_df = df.filter(regex=('bin'),axis=1)
# df_run = pd.concat([df_norm,temp_df],axis=1)#,sort=False)
# df_run.describe()
In [81]:
# # LOG TRANSFORM AND Z-SCORE DATA
# col_names = vars_to_norm
# df_norm = df[vars_to_norm].copy()
# # df_norm.info()
# for col in col_names:
#     data_to_log = df.loc[df[col]>0]
#     df_norm['logZ_'+col] = log_z(data_to_log[col])
# #     df_norm['logZ_'+col] = log_z(df[col])
# df_norm  = df_norm.filter(regex=('log'),axis=1)
# df_norm.info()

CHECKING NORMALIZED DATASET

In [144]:
plt.style.use('fivethirtyeight')
plot_hist_scat_sns(df_norm)
In [83]:
df_norm.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 23 columns):
price                19053 non-null float64
bedrooms             19053 non-null float64
bathrooms            19053 non-null float64
sqft_living          19053 non-null float64
sqft_lot             19053 non-null int64
floors               19053 non-null float64
condition            19053 non-null int64
grade                19053 non-null int64
zipcode              19053 non-null category
sqft_living15        19053 non-null float64
sqft_lot15           19053 non-null int64
code_view            19053 non-null int8
code_waterfront      19053 non-null int8
is_reno              19053 non-null int64
bins_yrbuilt         19053 non-null category
bins_sqftbasement    19053 non-null category
bins_sqftabove       19053 non-null category
sca_sqft_living      19053 non-null float64
sca_sqft_lot         19053 non-null float64
sca_sqft_living15    19053 non-null float64
sca_sqft_lot15       19053 non-null float64
sca_bedrooms         19053 non-null float64
sca_bathrooms        19053 non-null float64
dtypes: category(4), float64(12), int64(5), int8(2)
memory usage: 2.7 MB

Recheck multipol

In [84]:
multiplot(df_norm.filter(regex='sca',axis=1))
plt.title('Scaled Data only')
Out[84]:
Text(0.5, 1.0, 'Scaled Data only')
In [85]:
X =df_norm.loc[:,~(df_norm.columns.str.startswith('sca'))]
multiplot(X.drop('price',axis=1))
plt.title('Un-scaled Data')
X=[]

No multicollinearity to worry about. Huzzah!

In [86]:
# DEFINING DATASET TO RUN (df_run)
df_run = df_norm.copy()
# df_run.to_csv('df_run_pre_codes.csv')
df_run.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 23 columns):
price                19053 non-null float64
bedrooms             19053 non-null float64
bathrooms            19053 non-null float64
sqft_living          19053 non-null float64
sqft_lot             19053 non-null int64
floors               19053 non-null float64
condition            19053 non-null int64
grade                19053 non-null int64
zipcode              19053 non-null category
sqft_living15        19053 non-null float64
sqft_lot15           19053 non-null int64
code_view            19053 non-null int8
code_waterfront      19053 non-null int8
is_reno              19053 non-null int64
bins_yrbuilt         19053 non-null category
bins_sqftbasement    19053 non-null category
bins_sqftabove       19053 non-null category
sca_sqft_living      19053 non-null float64
sca_sqft_lot         19053 non-null float64
sca_sqft_living15    19053 non-null float64
sca_sqft_lot15       19053 non-null float64
sca_bedrooms         19053 non-null float64
sca_bathrooms        19053 non-null float64
dtypes: category(4), float64(12), int64(5), int8(2)
memory usage: 2.7 MB

CAT.CODES FOR BINNED DATA

In [87]:
df_filt = pd.DataFrame({})

df_filt = df_run.filter(regex=('bins_'),axis =1).copy()
df_filt['zipcode']=df_run['zipcode'].copy()
df_filt.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 4 columns):
bins_yrbuilt         19053 non-null category
bins_sqftbasement    19053 non-null category
bins_sqftabove       19053 non-null category
zipcode              19053 non-null category
dtypes: category(4)
memory usage: 227.0 KB
In [88]:
# Creating binned vars cat codes
for col in df_filt:
    df_filt['code'+col] = df_filt[col].cat.codes
    df_filt.drop(col,axis=1,inplace=True)
df_filt.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 4 columns):
codebins_yrbuilt         19053 non-null int8
codebins_sqftbasement    19053 non-null int8
codebins_sqftabove       19053 non-null int8
codezipcode              19053 non-null int8
dtypes: int8(4)
memory usage: 223.3 KB
In [89]:
df_filt.head()
Out[89]:
codebins_yrbuilt codebins_sqftbasement codebins_sqftabove codezipcode
0 5 0 2 66
1 5 1 4 55
2 3 0 1 16
3 6 2 2 58
4 8 0 3 37

Concatenate final df for modeling (df_run)

In [90]:
## Select columns that do not contain the string 'logZ'
# df_run =df_run.loc[:,~(df_run.columns.str.startswith('logZ'))]
In [91]:
# # df_filt['price'] = df_run['price'].copy()
# df_run =df_run.loc[:,~(df_run.columns.str.startswith('bin'))]
# df_run.info()
In [92]:
# # DECIDE NOW IF WANT CLEAN OR NOT CLEANED DATA

# to_drop_unclean = ['bedrooms','bathrooms','sqft_living','sqft_lot','sqft_living15','sqft_lot15','zClean_price']
# to_drop_clean = list(df_run.loc[:,~(df_run.columns.str.startswith('zClean'))].columns)
# to_drop_unclean, to_drop_clean
In [93]:
# DROP UNCLEAN
# df_run.drop(['id','bedrooms','bathrooms','sqft_living','sqft_lot','sqft_living15','sqft_lot15','zClean_price'],axis=1,inplace=True)
# df_run.drop(to_drop_unclean,axis=1,inplace=True)
In [94]:
# Concatenate codebins from df_filt + df_run
df_run = pd.concat([df_run, df_filt],axis=1)  #).filter(regex=('code'))],axis=1)
df_run.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 27 columns):
price                    19053 non-null float64
bedrooms                 19053 non-null float64
bathrooms                19053 non-null float64
sqft_living              19053 non-null float64
sqft_lot                 19053 non-null int64
floors                   19053 non-null float64
condition                19053 non-null int64
grade                    19053 non-null int64
zipcode                  19053 non-null category
sqft_living15            19053 non-null float64
sqft_lot15               19053 non-null int64
code_view                19053 non-null int8
code_waterfront          19053 non-null int8
is_reno                  19053 non-null int64
bins_yrbuilt             19053 non-null category
bins_sqftbasement        19053 non-null category
bins_sqftabove           19053 non-null category
sca_sqft_living          19053 non-null float64
sca_sqft_lot             19053 non-null float64
sca_sqft_living15        19053 non-null float64
sca_sqft_lot15           19053 non-null float64
sca_bedrooms             19053 non-null float64
sca_bathrooms            19053 non-null float64
codebins_yrbuilt         19053 non-null int8
codebins_sqftbasement    19053 non-null int8
codebins_sqftabove       19053 non-null int8
codezipcode              19053 non-null int8
dtypes: category(4), float64(12), int64(5), int8(6)
memory usage: 2.8 MB
In [95]:
df_run.head(2)
Out[95]:
price bedrooms bathrooms sqft_living sqft_lot floors condition grade zipcode sqft_living15 ... sca_sqft_living sca_sqft_lot sca_sqft_living15 sca_sqft_lot15 sca_bedrooms sca_bathrooms codebins_yrbuilt codebins_sqftbasement codebins_sqftabove codezipcode
0 221900.0 3.0 1.00 1180.0 5650 1.0 3 7 98178 1340.0 ... -0.663265 -0.355075 -0.569620 -0.391872 0.0 -1.00 5 0 2 66
1 538000.0 3.0 2.25 2570.0 7242 2.0 3 7 98125 1690.0 ... 0.755102 -0.047858 -0.126582 0.026953 0.0 0.25 5 1 4 55

2 rows × 27 columns

Clever line of code to select columns by name

# Select columns that do not contain the string 'logZ'
df_run =df_run.loc[:,~(df_run.columns.str.startswith('logZ'))]

Saving/loading df_run after cleaning up

In [96]:
# bookmark 202am
# df_run.to_csv('df_run_coded.csv')
In [97]:
# df_run = pd.read_csv('df_run_coded.csv')
# df_run.info()

FITTING AN INTIAL MODEL:

In [98]:
df_run.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 27 columns):
price                    19053 non-null float64
bedrooms                 19053 non-null float64
bathrooms                19053 non-null float64
sqft_living              19053 non-null float64
sqft_lot                 19053 non-null int64
floors                   19053 non-null float64
condition                19053 non-null int64
grade                    19053 non-null int64
zipcode                  19053 non-null category
sqft_living15            19053 non-null float64
sqft_lot15               19053 non-null int64
code_view                19053 non-null int8
code_waterfront          19053 non-null int8
is_reno                  19053 non-null int64
bins_yrbuilt             19053 non-null category
bins_sqftbasement        19053 non-null category
bins_sqftabove           19053 non-null category
sca_sqft_living          19053 non-null float64
sca_sqft_lot             19053 non-null float64
sca_sqft_living15        19053 non-null float64
sca_sqft_lot15           19053 non-null float64
sca_bedrooms             19053 non-null float64
sca_bathrooms            19053 non-null float64
codebins_yrbuilt         19053 non-null int8
codebins_sqftbasement    19053 non-null int8
codebins_sqftabove       19053 non-null int8
codezipcode              19053 non-null int8
dtypes: category(4), float64(12), int64(5), int8(6)
memory usage: 2.8 MB

DETERMINING IDEAL FEATURES TO USE

  • Use MinMaxScaler to get on same scale
  • Use RFE to find the best features
  • Get ranking of feature importance (from both scaled and unscaled data)
In [99]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler

import sklearn.metrics as metrics
In [100]:
df_run.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 27 columns):
price                    19053 non-null float64
bedrooms                 19053 non-null float64
bathrooms                19053 non-null float64
sqft_living              19053 non-null float64
sqft_lot                 19053 non-null int64
floors                   19053 non-null float64
condition                19053 non-null int64
grade                    19053 non-null int64
zipcode                  19053 non-null category
sqft_living15            19053 non-null float64
sqft_lot15               19053 non-null int64
code_view                19053 non-null int8
code_waterfront          19053 non-null int8
is_reno                  19053 non-null int64
bins_yrbuilt             19053 non-null category
bins_sqftbasement        19053 non-null category
bins_sqftabove           19053 non-null category
sca_sqft_living          19053 non-null float64
sca_sqft_lot             19053 non-null float64
sca_sqft_living15        19053 non-null float64
sca_sqft_lot15           19053 non-null float64
sca_bedrooms             19053 non-null float64
sca_bathrooms            19053 non-null float64
codebins_yrbuilt         19053 non-null int8
codebins_sqftbasement    19053 non-null int8
codebins_sqftabove       19053 non-null int8
codezipcode              19053 non-null int8
dtypes: category(4), float64(12), int64(5), int8(6)
memory usage: 2.8 MB
In [101]:
# Define selector function combining RFE and linear regression
linreg = LinearRegression()
selector = RFE(linreg, n_features_to_select=1)

# Drop already scaled variables for this feature testing
X =df_run.loc[:,~(df_run.columns.str.startswith(('bins','zip')))]
X = X.drop('price',axis=1)

# RUNNING RFE ON THE UNSCALED DATA(DEMONSTRATION)
Y = df_run['price']
# Y = df_run['logz_price']
# X = df_run.drop(['price'],axis=1)
In [102]:
# Checking X
X.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 22 columns):
bedrooms                 19053 non-null float64
bathrooms                19053 non-null float64
sqft_living              19053 non-null float64
sqft_lot                 19053 non-null int64
floors                   19053 non-null float64
condition                19053 non-null int64
grade                    19053 non-null int64
sqft_living15            19053 non-null float64
sqft_lot15               19053 non-null int64
code_view                19053 non-null int8
code_waterfront          19053 non-null int8
is_reno                  19053 non-null int64
sca_sqft_living          19053 non-null float64
sca_sqft_lot             19053 non-null float64
sca_sqft_living15        19053 non-null float64
sca_sqft_lot15           19053 non-null float64
sca_bedrooms             19053 non-null float64
sca_bathrooms            19053 non-null float64
codebins_yrbuilt         19053 non-null int8
codebins_sqftbasement    19053 non-null int8
codebins_sqftabove       19053 non-null int8
codezipcode              19053 non-null int8
dtypes: float64(11), int64(5), int8(6)
memory usage: 2.6 MB
In [103]:
# Run regressions on X,Y 
selector = selector.fit(X,Y)

# Saving unscaled rankings for demo purposes
no_scale = selector.ranking_
In [104]:
# Scale all variables to value between 0-1 to use RFE to determine which features are the most important for determining price?
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

# Scale the data before running RFE
print('Consider revisiting this step and dummy-coding zipcode.')

# ONLY SCALE NON-CATEGORICAL, ONE-HOT CATEGORICAL
scaler.fit(X,Y)
scaled_data = scaler.transform(X)
scaled_data.shape
Consider revisiting this step and dummy-coding zipcode.
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\sklearn\preprocessing\data.py:323: DataConversionWarning: Data with input dtype int8, int64, float64 were all converted to float64 by MinMaxScaler.
  return self.partial_fit(X, y)
Out[104]:
(19053, 22)
In [105]:
# Running RFE with scaled data
selector = selector.fit(scaled_data, Y) 
scaled = selector.ranking_
In [106]:
# Create a dataframe with the ranked values of each feature for both scaled and unscaled data
best_features = pd.DataFrame({'columns':X.columns, 'scaled_rank' : scaled,'unscaled_rank':no_scale})
best_features.set_index('columns',inplace=True)
In [107]:
# Display dataframe (sorted based on unscaled rank)
best_features.sort_values('unscaled_rank')
Out[107]:
scaled_rank unscaled_rank
columns
grade 1 1
code_view 8 2
codebins_sqftbasement 11 3
codebins_sqftabove 21 4
floors 9 5
codebins_yrbuilt 3 6
sca_bathrooms 18 7
is_reno 20 8
condition 13 9
sca_bedrooms 19 10
bathrooms 15 11
bedrooms 16 12
code_waterfront 22 13
codezipcode 17 14
sqft_living 2 15
sqft_living15 12 16
sqft_lot15 14 17
sqft_lot 5 18
sca_sqft_living15 7 19
sca_sqft_living 4 20
sca_sqft_lot15 6 21
sca_sqft_lot 10 22
In [108]:
# Plot the difference in feature importance between analyzing scaled and unscaled data. 
# For demonstration purposes.
features = pd.DataFrame({'Columns':X.columns, 'Not_Scaled':no_scale, 'Scaled':scaled})
# PLot the difference between 
plt.style.use('seaborn')
# plt.xkcd()
ax = features.set_index('Columns').plot(kind = 'bar',figsize=(12,8))
ax.set_title('Feature Importance Scaled vs. Not Scaled')
ax.set_ylabel('Features Importance')
Out[108]:
Text(0, 0.5, 'Features Importance')

Using elbow plots to identify the best # of features to use

In [109]:
X.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 22 columns):
bedrooms                 19053 non-null float64
bathrooms                19053 non-null float64
sqft_living              19053 non-null float64
sqft_lot                 19053 non-null int64
floors                   19053 non-null float64
condition                19053 non-null int64
grade                    19053 non-null int64
sqft_living15            19053 non-null float64
sqft_lot15               19053 non-null int64
code_view                19053 non-null int8
code_waterfront          19053 non-null int8
is_reno                  19053 non-null int64
sca_sqft_living          19053 non-null float64
sca_sqft_lot             19053 non-null float64
sca_sqft_living15        19053 non-null float64
sca_sqft_lot15           19053 non-null float64
sca_bedrooms             19053 non-null float64
sca_bathrooms            19053 non-null float64
codebins_yrbuilt         19053 non-null int8
codebins_sqftbasement    19053 non-null int8
codebins_sqftabove       19053 non-null int8
codezipcode              19053 non-null int8
dtypes: float64(11), int64(5), int8(6)
memory usage: 2.6 MB

Plot R_squared and MSE for Scaled Data

In [110]:
r_squared = []
for x in range(1, len(X.columns)):
    selector = RFE(linreg, n_features_to_select=x)
    selector.fit(scaled_data, Y)
    linreg.fit(X[X.columns[selector.support_]], Y)
    r_sq = linreg.score(X[X.columns[selector.support_]], Y)
    r_squared.append(r_sq)
In [111]:
# r_squared
In [112]:
from sklearn.metrics import mean_squared_error

mse=[]
for x in range(1, len(X.columns)):
    selector = RFE(linreg,  n_features_to_select=x)
    selector.fit(scaled_data, Y)
    linreg.fit(X[X.columns[selector.support_]], Y)
    y_pred = linreg.predict(X[X.columns[selector.support_]])
    mse.append(mean_squared_error(Y, y_pred))
# mse
In [113]:
from sklearn.metrics import mean_squared_error

fig = plt.figure(figsize=(12 ,6))

ax1 = plt.subplot(121)
ax1.plot(range(1, len(X.columns)), r_squared)
ax1.set_ylabel('R_Squared')
ax1.set_xlabel('Number of Features')
ax1.grid()

ax2 = plt.subplot(122)
ax2.plot(range(1,len(mse)+1), mse )
ax2.set_ylabel('MSE')

ax2.set_xlabel('Number of Features',fontsize=20)
ax2.grid()

Plot R_squared and MSE for Unscaled Data

In [114]:
r_squared = []
for x in range(1, len(X.columns)):
    selector = RFE(linreg, n_features_to_select=x)
    selector.fit(X, Y)
    linreg.fit(X[X.columns[selector.support_]], Y)
    r_sq = linreg.score(X[X.columns[selector.support_]], Y)
    r_squared.append(r_sq)

    
from sklearn.metrics import mean_squared_error

mse=[]
for x in range(1, len(X.columns)):
    selector = RFE(linreg,  n_features_to_select=x)
    selector.fit(X, Y)
    linreg.fit(X[X.columns[selector.support_]], Y)
    y_pred = linreg.predict(X[X.columns[selector.support_]])
    mse.append(mean_squared_error(Y, y_pred))



fig = plt.figure(figsize=(12 ,6))

ax1 = plt.subplot(121)
ax1.plot(range(1, len(X.columns)), r_squared)
ax1.set_ylabel('R_Squared')
ax1.set_xlabel('Number of Features')
ax1.grid()

ax2 = plt.subplot(122)
ax2.plot(range(1,len(mse)+1), mse )
ax2.set_ylabel('MSE')

ax2.set_xlabel('Number of Features')
ax2.grid() 

Choosing Features Based on Rankings

  • The above figure is indicating that right now my best possible R2 with the lowest # of features would be with 6 predictors (judging unscaled data).
  • Now examine the sorted best_features dataframe to see which 6 to use
In [115]:
best_features.sort_values('unscaled_rank')
Out[115]:
scaled_rank unscaled_rank
columns
grade 1 1
code_view 8 2
codebins_sqftbasement 11 3
codebins_sqftabove 21 4
floors 9 5
codebins_yrbuilt 3 6
sca_bathrooms 18 7
is_reno 20 8
condition 13 9
sca_bedrooms 19 10
bathrooms 15 11
bedrooms 16 12
code_waterfront 22 13
codezipcode 17 14
sqft_living 2 15
sqft_living15 12 16
sqft_lot15 14 17
sqft_lot 5 18
sca_sqft_living15 7 19
sca_sqft_living 4 20
sca_sqft_lot15 6 21
sca_sqft_lot 10 22
In [116]:
# Pull out the 6 best features via ranking 
best_num_feat = 6
selected_features = best_features.sort_values('unscaled_rank')[0:best_num_feat]
selected_features
Out[116]:
scaled_rank unscaled_rank
columns
grade 1 1
code_view 8 2
codebins_sqftbasement 11 3
codebins_sqftabove 21 4
floors 9 5
codebins_yrbuilt 3 6
In [117]:
# USING UNSCALED
selected_features.index
Out[117]:
Index(['grade', 'code_view', 'codebins_sqftbasement', 'codebins_sqftabove',
       'floors', 'codebins_yrbuilt'],
      dtype='object', name='columns')
In [118]:
# Check the columns of X
X[selected_features.index].head()
Out[118]:
grade code_view codebins_sqftbasement codebins_sqftabove floors codebins_yrbuilt
0 7 0 0 2 1.0 5
1 7 0 1 4 2.0 5
2 6 0 0 1 1.0 3
3 7 0 2 2 1.0 6
4 8 0 0 3 1.0 8

PRELIMINARY UNIVARIATE LINEAR REGRESSION MODELING

In [119]:
df_run.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 27 columns):
price                    19053 non-null float64
bedrooms                 19053 non-null float64
bathrooms                19053 non-null float64
sqft_living              19053 non-null float64
sqft_lot                 19053 non-null int64
floors                   19053 non-null float64
condition                19053 non-null int64
grade                    19053 non-null int64
zipcode                  19053 non-null category
sqft_living15            19053 non-null float64
sqft_lot15               19053 non-null int64
code_view                19053 non-null int8
code_waterfront          19053 non-null int8
is_reno                  19053 non-null int64
bins_yrbuilt             19053 non-null category
bins_sqftbasement        19053 non-null category
bins_sqftabove           19053 non-null category
sca_sqft_living          19053 non-null float64
sca_sqft_lot             19053 non-null float64
sca_sqft_living15        19053 non-null float64
sca_sqft_lot15           19053 non-null float64
sca_bedrooms             19053 non-null float64
sca_bathrooms            19053 non-null float64
codebins_yrbuilt         19053 non-null int8
codebins_sqftbasement    19053 non-null int8
codebins_sqftabove       19053 non-null int8
codezipcode              19053 non-null int8
dtypes: category(4), float64(12), int64(5), int8(6)
memory usage: 2.8 MB
In [120]:
# Running simple linear regression for each predictor on its own
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stats
import statsmodels.stats.api as sms


# log_price = np.log(df['price'])
# df['log_price'] = log_price

target_var = 'price'
col_names = df_run.drop(['price'],axis=1).columns

# Create results list for saving the output statstics for each predictor
results = [['ind_var', 'r_squared', 'intercept', 'slope', 'p-value' ]] 

for idx, val in enumerate(col_names): 
    
    # Use the names of the columns to determine format of forumla  
    if val.startswith('code'):
        
        df_run[val] = df_run[val].astype('category').cat.as_ordered() 
        f =f'{str(target_var)}~C({val})'
        
    elif val.startswith('bin'):
        
        df_run[val] = df_run[val].cat.as_ordered() 
        f =f'{str(target_var)}~C({val})'
        
    else:
        
        f =f'{str(target_var)}~{val}'
        
    # Run the ols models     
    model = smf.ols(formula=f, data=df_run).fit() 
    model.summary()
    
    # Append results
    results.append([val, model.rsquared, model.params[0], model.params[1], model.pvalues[1] ]) 
    
# Turn results into dataframe with correct index and columns
res_df = pd.DataFrame(results)
res_df.columns = res_df.iloc[0]
res_df=res_df[1:]
res_df.set_index('ind_var',inplace=True)
res_df.sort_values('r_squared',ascending=False)
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\statsmodels\regression\linear_model.py:1633: RuntimeWarning: divide by zero encountered in double_scalars
  return np.sqrt(eigvals[0]/eigvals[-1])
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\statsmodels\base\model.py:1100: RuntimeWarning: invalid value encountered in true_divide
  return self.params / self.bse
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\scipy\stats\_distn_infrastructure.py:877: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\scipy\stats\_distn_infrastructure.py:877: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\scipy\stats\_distn_infrastructure.py:1831: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)
Out[120]:
r_squared intercept slope p-value
ind_var
codezipcode 0.498102 280365 -47520.3 0.0002275
zipcode 0.498102 280365 -47520.3 0.0002275
grade 0.362959 -454845 123114 0
sqft_living 0.347657 142997 168.572 0
sca_sqft_living 0.347657 451484 165200 0
sqft_living15 0.27516 118553 184.751 0
sca_sqft_living15 0.27516 449258 145953 0
codebins_sqftabove 0.225333 210050 141861 0.107357
bins_sqftabove 0.225333 210050 141861 0.107357
bathrooms 0.169802 215907 125027 0
sca_bathrooms 0.169802 465961 125027 0
sca_bedrooms 0.0717169 447998 68611.9 0
bedrooms 0.0717169 242163 68611.9 0
floors 0.0629911 330868 93475.5 1.61001e-271
code_view 0.0563913 455177 168640 1.60061e-43
bins_sqftbasement 0.0492165 442306 19843.1 3.81601e-06
codebins_sqftbasement 0.0492165 442306 19843.1 3.81601e-06
codebins_yrbuilt 0.0482952 543106 -35754.2 0.000899441
bins_yrbuilt 0.0482952 543106 -35754.2 0.000899441
is_reno 0.00765121 465065 103741 1.10517e-33
sca_sqft_lot 0.00581721 465533 2035.97 5.50156e-26
sqft_lot 0.00581721 462590 0.392892 5.50156e-26
sqft_lot15 0.00411256 461884 0.518291 7.96816e-19
sca_sqft_lot15 0.00411256 465777 2461.37 7.96816e-19
code_waterfront 0.00239512 467280 203173 1.64907e-11
condition 0.00235454 416978 14985.1 2.065e-11
In [121]:
# Initial variables for modeling
try_modeling = ['codezipcode', 'grade','sca_sqft_living', 'sca_sqft_living15']
# Hmm...realized there are redundant versions of variables and am just selecting the correct versions to use.

NOTES FOLLOWING PRELIMINARY LINEAR REGRESSIONS

  • Variables that had high R_square with logz_price:
    • New results = ['codezipcode', 'grade','sca_sqft_living', 'sca_sqft_living15']
try_modeling = try_modeling = ['codezipcode', 'grade','sca_sqft_living', 'sca_sqft_living16']

MULTIVARIATE REGRESSIONS

In [122]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stats
import statsmodels.stats.api as sms
from sklearn.preprocessing import MinMaxScaler

# # PUTTING TOGETHER THE PREDICTORS TO RUN IN THE REGRESSION
# ## Last min dummy vars []'cat_grade','cat_zipcode','cat_view','cat_bins_sqft_above','cat_bins_sqft_basement']
# dum_grades = pd.get_dummies(df_run['cat_grade'],prefix='gr').iloc[:,:-1]
# dum_view = pd.get_dummies(df_run['cat_view'], prefix='view').iloc[:,:-1]
# dum_sqft_above = pd.get_dummies(df_run['cat_bins_sqftabove'],prefix='sqftAb').iloc[:,:-1]
# dum_sqft_base = pd.get_dummies(df_run['cat_bins_sqftbasement'],prefix='sqftBa').iloc[:,:-1]


# RUNNING K-FOLD VALIDATION WITH STATSMODELS OLS.
# X = df_run.drop(['price','logZ_price'],axis=1)
# list_predictors = ['logZ_sqft_living','logZ_sqft_living15','bedrooms','bathrooms','floors']
# scaler = MinMaxScaler()
# sel_columns = selected_features.index
In [123]:
# Define X, Y 
X = df_run[try_modeling]
# X.columns

Y = df_run['price']
# y = df_run['logZ_price']

# Get a list of predictor names string 
list_predictors = [str(x) for x in X.columns]
list_predictors.append('intercept')
list_predictors
Out[123]:
['codezipcode', 'grade', 'sca_sqft_living', 'sca_sqft_living15', 'intercept']
In [124]:
# Comcatenate X,Y for OLS
df_run_ols = pd.concat([Y,X],axis=1)

# Import packages
import statsmodels.api as sm
import statsmodels.stats.api as sms
import statsmodels.formula.api as smf
import scipy.stats as stats

# Enter equation for selected predictors: (use C to run as categorical) 
# f1 = 'price~C(codezipcode)+C(grade)+sca_sqft_living+sca_sqft_living15' # 0.8 r1 Adjusted
f1 = 'price~C(codezipcode)+grade+sca_sqft_living+sca_sqft_living15' 

# Run model and report sumamry
model = smf.ols(formula=f1, data=df_run_ols).fit()
model.summary()
Out[124]:
OLS Regression Results
Dep. Variable: price R-squared: 0.797
Model: OLS Adj. R-squared: 0.796
Method: Least Squares F-statistic: 1035.
Date: Sun, 17 Mar 2019 Prob (F-statistic): 0.00
Time: 22:11:13 Log-Likelihood: -2.4441e+05
No. Observations: 19053 AIC: 4.890e+05
Df Residuals: 18980 BIC: 4.895e+05
Df Model: 72
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 5152.6368 8731.030 0.590 0.555 -1.2e+04 2.23e+04
C(codezipcode)[T.1] 2.361e+04 8215.662 2.873 0.004 7503.139 3.97e+04
C(codezipcode)[T.2] 2341.0236 7378.061 0.317 0.751 -1.21e+04 1.68e+04
C(codezipcode)[T.3] 5.114e+05 8885.250 57.560 0.000 4.94e+05 5.29e+05
C(codezipcode)[T.4] 3.405e+05 9150.155 37.213 0.000 3.23e+05 3.58e+05
C(codezipcode)[T.5] 2.802e+05 6825.525 41.045 0.000 2.67e+05 2.94e+05
C(codezipcode)[T.6] 2.573e+05 9464.329 27.187 0.000 2.39e+05 2.76e+05
C(codezipcode)[T.7] 2.557e+05 7457.922 34.291 0.000 2.41e+05 2.7e+05
C(codezipcode)[T.8] 1.054e+05 1.05e+04 10.074 0.000 8.49e+04 1.26e+05
C(codezipcode)[T.9] 1.348e+05 8258.863 16.319 0.000 1.19e+05 1.51e+05
C(codezipcode)[T.10] 1.282e+05 9852.210 13.014 0.000 1.09e+05 1.48e+05
C(codezipcode)[T.11] 8.965e+04 8371.036 10.710 0.000 7.32e+04 1.06e+05
C(codezipcode)[T.12] 4.869e+04 7811.281 6.234 0.000 3.34e+04 6.4e+04
C(codezipcode)[T.13] -2.103e+04 6398.367 -3.287 0.001 -3.36e+04 -8487.932
C(codezipcode)[T.14] 1.723e+05 1.21e+04 14.294 0.000 1.49e+05 1.96e+05
C(codezipcode)[T.15] 1.893e+05 6900.034 27.438 0.000 1.76e+05 2.03e+05
C(codezipcode)[T.16] 1.282e+05 7366.804 17.401 0.000 1.14e+05 1.43e+05
C(codezipcode)[T.17] 2.198e+05 7223.982 30.430 0.000 2.06e+05 2.34e+05
C(codezipcode)[T.18] 907.7456 7534.878 0.120 0.904 -1.39e+04 1.57e+04
C(codezipcode)[T.19] 1.147e+04 7391.085 1.552 0.121 -3015.547 2.6e+04
C(codezipcode)[T.20] 3234.7083 9600.539 0.337 0.736 -1.56e+04 2.21e+04
C(codezipcode)[T.21] 3.27e+05 6873.901 47.578 0.000 3.14e+05 3.41e+05
C(codezipcode)[T.22] 1.858e+05 6322.193 29.392 0.000 1.73e+05 1.98e+05
C(codezipcode)[T.23] 3.341e+04 6189.647 5.398 0.000 2.13e+04 4.55e+04
C(codezipcode)[T.24] 6.446e+05 3.72e+04 17.333 0.000 5.72e+05 7.17e+05
C(codezipcode)[T.25] 4.396e+05 8998.802 48.846 0.000 4.22e+05 4.57e+05
C(codezipcode)[T.26] 1.342e+04 6249.489 2.147 0.032 1166.174 2.57e+04
C(codezipcode)[T.27] 1.194e+05 7939.547 15.036 0.000 1.04e+05 1.35e+05
C(codezipcode)[T.28] 2.492e+05 6292.625 39.602 0.000 2.37e+05 2.62e+05
C(codezipcode)[T.29] 2.347e+05 6937.430 33.827 0.000 2.21e+05 2.48e+05
C(codezipcode)[T.30] 4.06e+04 7516.748 5.402 0.000 2.59e+04 5.53e+04
C(codezipcode)[T.31] 1.093e+05 6693.020 16.330 0.000 9.62e+04 1.22e+05
C(codezipcode)[T.32] 3.932e+04 6534.860 6.018 0.000 2.65e+04 5.21e+04
C(codezipcode)[T.33] 9.51e+04 6589.751 14.431 0.000 8.22e+04 1.08e+05
C(codezipcode)[T.34] 1.201e+05 7412.001 16.199 0.000 1.06e+05 1.35e+05
C(codezipcode)[T.35] 1.975e+05 9885.649 19.982 0.000 1.78e+05 2.17e+05
C(codezipcode)[T.36] 1.742e+05 7612.251 22.880 0.000 1.59e+05 1.89e+05
C(codezipcode)[T.37] 2.137e+05 6770.358 31.559 0.000 2e+05 2.27e+05
C(codezipcode)[T.38] 2.286e+05 7438.126 30.732 0.000 2.14e+05 2.43e+05
C(codezipcode)[T.39] 1.781e+05 8684.281 20.503 0.000 1.61e+05 1.95e+05
C(codezipcode)[T.40] -1.148e+04 6965.141 -1.648 0.099 -2.51e+04 2173.340
C(codezipcode)[T.41] 4.05e+05 1.11e+04 36.408 0.000 3.83e+05 4.27e+05
C(codezipcode)[T.42] 3.297e+05 6239.985 52.830 0.000 3.17e+05 3.42e+05
C(codezipcode)[T.43] 3.908e+05 8585.748 45.522 0.000 3.74e+05 4.08e+05
C(codezipcode)[T.44] 1.203e+05 7095.802 16.956 0.000 1.06e+05 1.34e+05
C(codezipcode)[T.45] 3.252e+05 7546.464 43.096 0.000 3.1e+05 3.4e+05
C(codezipcode)[T.46] 1.193e+05 8417.753 14.169 0.000 1.03e+05 1.36e+05
C(codezipcode)[T.47] 4.236e+05 1.13e+04 37.566 0.000 4.01e+05 4.46e+05
C(codezipcode)[T.48] 4.364e+05 8724.465 50.019 0.000 4.19e+05 4.53e+05
C(codezipcode)[T.49] 3.271e+05 6266.348 52.207 0.000 3.15e+05 3.39e+05
C(codezipcode)[T.50] 3.16e+05 7192.406 43.936 0.000 3.02e+05 3.3e+05
C(codezipcode)[T.51] 3.237e+05 6318.347 51.230 0.000 3.11e+05 3.36e+05
C(codezipcode)[T.52] 1.744e+05 6436.099 27.102 0.000 1.62e+05 1.87e+05
C(codezipcode)[T.53] 4.177e+05 9094.361 45.931 0.000 4e+05 4.36e+05
C(codezipcode)[T.54] 3.163e+05 7509.896 42.116 0.000 3.02e+05 3.31e+05
C(codezipcode)[T.55] 2.019e+05 6737.030 29.975 0.000 1.89e+05 2.15e+05
C(codezipcode)[T.56] 2.127e+05 6999.314 30.390 0.000 1.99e+05 2.26e+05
C(codezipcode)[T.57] 1.563e+05 6413.853 24.371 0.000 1.44e+05 1.69e+05
C(codezipcode)[T.58] 2.722e+05 7617.985 35.728 0.000 2.57e+05 2.87e+05
C(codezipcode)[T.59] 2.5e+05 7152.431 34.956 0.000 2.36e+05 2.64e+05
C(codezipcode)[T.60] 1.264e+05 7419.402 17.037 0.000 1.12e+05 1.41e+05
C(codezipcode)[T.61] 5.933e+04 1.32e+04 4.489 0.000 3.34e+04 8.52e+04
C(codezipcode)[T.62] 1.486e+05 6562.321 22.643 0.000 1.36e+05 1.61e+05
C(codezipcode)[T.63] 1.299e+05 7685.234 16.905 0.000 1.15e+05 1.45e+05
C(codezipcode)[T.64] 5.063e+04 7497.228 6.753 0.000 3.59e+04 6.53e+04
C(codezipcode)[T.65] 2.276e+05 7840.903 29.027 0.000 2.12e+05 2.43e+05
C(codezipcode)[T.66] 7.063e+04 7516.470 9.396 0.000 5.59e+04 8.54e+04
C(codezipcode)[T.67] 3.793e+04 9456.808 4.011 0.000 1.94e+04 5.65e+04
C(codezipcode)[T.68] 4.591e+04 7416.408 6.191 0.000 3.14e+04 6.05e+04
C(codezipcode)[T.69] 3.666e+05 7441.021 49.269 0.000 3.52e+05 3.81e+05
grade 3.645e+04 1004.395 36.289 0.000 3.45e+04 3.84e+04
sca_sqft_living 1.058e+05 1483.602 71.280 0.000 1.03e+05 1.09e+05
sca_sqft_living15 3.646e+04 1505.741 24.213 0.000 3.35e+04 3.94e+04
Omnibus: 2745.646 Durbin-Watson: 2.000
Prob(Omnibus): 0.000 Jarque-Bera (JB): 8873.079
Skew: 0.739 Prob(JB): 0.00
Kurtosis: 5.998 Cond. No. 491.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Cross-Validation with K-Fold Test-Train Splits:

- f1 = 'price ~ C(codezipcode) + grade + sca_sqft_living + sca_sqft_living15'
  • price ~ zipcode(category), grade(category), sqft_living(scaled to median, RobustScaler) + sqft_living15 (scaled to median, RobustScaler)

Save df_run_ols to csv

In [125]:
df_run_ols.to_csv('df_run_ols_model.csv')
In [126]:
# Visualize Q-Q Plots
resid1=model.resid
fig = sm.graphics.qqplot(resid1, dist=stats.norm, line='45', fit=True,marker='.')
  • Interpreting the Q-Q plot:
    • The Q-Q plot looks a bit crazy and may indicate... outliers?
    • The only thing I did not check for outliers in final model was zipcode.
    • Will run cross-validation with test-train-split to help decide
In [127]:
# Visualizing final dataset again.
import matplotlib.pyplot as plt

# Re-inspecting XY
# plot_hist_scat(df_run_ols)

FINAL REGRESSION RESULTS

K-Fold valiation with OLS

In [128]:
# k_fold_val_ols(X,y,k=10):
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics 

y = df_run['price']


# Run 10-fold cross validation
results = [['set#','R_square_train','MSE_train','R_square_test','MSE_test']]

num_coeff = X.shape[1]

list_predictors = [str(x) for x in X.columns]
list_predictors.append('intercept') 

reg_params = [list_predictors]

i=0
k=10
while i <(k+1):
    X_train, X_test, y_train, y_test = train_test_split(X,y) #,stratify=[cat_col_names])

    data = pd.concat([X_train,y_train],axis=1)
    f1 = 'price~C(codezipcode)+grade+sca_sqft_living+sca_sqft_living15' 
    model = smf.ols(formula=f1, data=data).fit()
    model.summary()
    
    y_hat_train = model.predict(X_train)
    y_hat_test = model.predict(X_test)

    train_residuals = y_hat_train - y_train
    test_residuals = y_hat_test - y_test

        
    train_mse = metrics.mean_squared_error(y_train, y_hat_train)
    test_mse = metrics.mean_squared_error(y_test, y_hat_test)

    R_sqare_train = metrics.r2_score(y_train,y_hat_train)
    R_square_test = metrics.r2_score(y_test,y_hat_test)

    results.append([i,R_sqare_train,train_mse,R_square_test,test_mse])
    i+=1

    
results = pd.DataFrame(results[1:],columns=results[0])
results
model.summary()
Out[128]:
OLS Regression Results
Dep. Variable: price R-squared: 0.798
Model: OLS Adj. R-squared: 0.797
Method: Least Squares F-statistic: 780.4
Date: Sun, 17 Mar 2019 Prob (F-statistic): 0.00
Time: 22:11:16 Log-Likelihood: -1.8327e+05
No. Observations: 14289 AIC: 3.667e+05
Df Residuals: 14216 BIC: 3.672e+05
Df Model: 72
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 8154.9226 1.01e+04 0.809 0.418 -1.16e+04 2.79e+04
C(codezipcode)[T.1] 2.702e+04 9562.260 2.826 0.005 8280.463 4.58e+04
C(codezipcode)[T.2] 840.9742 8647.467 0.097 0.923 -1.61e+04 1.78e+04
C(codezipcode)[T.3] 5.038e+05 1.06e+04 47.644 0.000 4.83e+05 5.25e+05
C(codezipcode)[T.4] 3.472e+05 1.04e+04 33.365 0.000 3.27e+05 3.68e+05
C(codezipcode)[T.5] 2.89e+05 7926.799 36.455 0.000 2.73e+05 3.05e+05
C(codezipcode)[T.6] 2.669e+05 1.12e+04 23.783 0.000 2.45e+05 2.89e+05
C(codezipcode)[T.7] 2.567e+05 8501.575 30.197 0.000 2.4e+05 2.73e+05
C(codezipcode)[T.8] 1.093e+05 1.21e+04 9.025 0.000 8.56e+04 1.33e+05
C(codezipcode)[T.9] 1.352e+05 9458.337 14.294 0.000 1.17e+05 1.54e+05
C(codezipcode)[T.10] 1.387e+05 1.13e+04 12.264 0.000 1.16e+05 1.61e+05
C(codezipcode)[T.11] 8.826e+04 9486.000 9.304 0.000 6.97e+04 1.07e+05
C(codezipcode)[T.12] 4.797e+04 9073.223 5.287 0.000 3.02e+04 6.58e+04
C(codezipcode)[T.13] -1.881e+04 7467.592 -2.518 0.012 -3.34e+04 -4167.666
C(codezipcode)[T.14] 1.713e+05 1.35e+04 12.670 0.000 1.45e+05 1.98e+05
C(codezipcode)[T.15] 1.884e+05 8114.779 23.216 0.000 1.72e+05 2.04e+05
C(codezipcode)[T.16] 1.294e+05 8591.602 15.066 0.000 1.13e+05 1.46e+05
C(codezipcode)[T.17] 2.223e+05 8387.995 26.507 0.000 2.06e+05 2.39e+05
C(codezipcode)[T.18] 2183.3877 8605.727 0.254 0.800 -1.47e+04 1.91e+04
C(codezipcode)[T.19] 1.061e+04 8671.263 1.224 0.221 -6386.319 2.76e+04
C(codezipcode)[T.20] 1161.9994 1.12e+04 0.104 0.917 -2.08e+04 2.31e+04
C(codezipcode)[T.21] 3.318e+05 7953.586 41.716 0.000 3.16e+05 3.47e+05
C(codezipcode)[T.22] 1.89e+05 7374.039 25.624 0.000 1.74e+05 2.03e+05
C(codezipcode)[T.23] 3.641e+04 7184.038 5.068 0.000 2.23e+04 5.05e+04
C(codezipcode)[T.24] 6.567e+05 5.24e+04 12.535 0.000 5.54e+05 7.59e+05
C(codezipcode)[T.25] 4.301e+05 1.07e+04 40.384 0.000 4.09e+05 4.51e+05
C(codezipcode)[T.26] 1.194e+04 7307.381 1.634 0.102 -2381.266 2.63e+04
C(codezipcode)[T.27] 1.211e+05 9199.655 13.165 0.000 1.03e+05 1.39e+05
C(codezipcode)[T.28] 2.513e+05 7307.829 34.393 0.000 2.37e+05 2.66e+05
C(codezipcode)[T.29] 2.397e+05 8066.849 29.715 0.000 2.24e+05 2.56e+05
C(codezipcode)[T.30] 4.195e+04 8605.994 4.874 0.000 2.51e+04 5.88e+04
C(codezipcode)[T.31] 1.103e+05 7678.803 14.369 0.000 9.53e+04 1.25e+05
C(codezipcode)[T.32] 4.077e+04 7644.794 5.333 0.000 2.58e+04 5.58e+04
C(codezipcode)[T.33] 9.456e+04 7654.943 12.353 0.000 7.96e+04 1.1e+05
C(codezipcode)[T.34] 1.198e+05 8552.858 14.006 0.000 1.03e+05 1.37e+05
C(codezipcode)[T.35] 1.943e+05 1.17e+04 16.558 0.000 1.71e+05 2.17e+05
C(codezipcode)[T.36] 1.726e+05 8775.554 19.665 0.000 1.55e+05 1.9e+05
C(codezipcode)[T.37] 2.134e+05 7844.208 27.210 0.000 1.98e+05 2.29e+05
C(codezipcode)[T.38] 2.298e+05 8545.176 26.894 0.000 2.13e+05 2.47e+05
C(codezipcode)[T.39] 1.821e+05 9913.970 18.371 0.000 1.63e+05 2.02e+05
C(codezipcode)[T.40] -4753.5296 8105.070 -0.586 0.558 -2.06e+04 1.11e+04
C(codezipcode)[T.41] 4.042e+05 1.32e+04 30.697 0.000 3.78e+05 4.3e+05
C(codezipcode)[T.42] 3.36e+05 7244.615 46.379 0.000 3.22e+05 3.5e+05
C(codezipcode)[T.43] 3.971e+05 9917.664 40.041 0.000 3.78e+05 4.17e+05
C(codezipcode)[T.44] 1.192e+05 8143.686 14.637 0.000 1.03e+05 1.35e+05
C(codezipcode)[T.45] 3.243e+05 8771.026 36.979 0.000 3.07e+05 3.42e+05
C(codezipcode)[T.46] 1.214e+05 9864.227 12.309 0.000 1.02e+05 1.41e+05
C(codezipcode)[T.47] 4.225e+05 1.31e+04 32.351 0.000 3.97e+05 4.48e+05
C(codezipcode)[T.48] 4.346e+05 1.04e+04 41.861 0.000 4.14e+05 4.55e+05
C(codezipcode)[T.49] 3.246e+05 7245.275 44.800 0.000 3.1e+05 3.39e+05
C(codezipcode)[T.50] 3.197e+05 8368.082 38.208 0.000 3.03e+05 3.36e+05
C(codezipcode)[T.51] 3.206e+05 7265.612 44.124 0.000 3.06e+05 3.35e+05
C(codezipcode)[T.52] 1.817e+05 7440.697 24.414 0.000 1.67e+05 1.96e+05
C(codezipcode)[T.53] 4.262e+05 1.08e+04 39.290 0.000 4.05e+05 4.47e+05
C(codezipcode)[T.54] 3.208e+05 8622.592 37.201 0.000 3.04e+05 3.38e+05
C(codezipcode)[T.55] 2.048e+05 7744.862 26.448 0.000 1.9e+05 2.2e+05
C(codezipcode)[T.56] 2.194e+05 8158.930 26.893 0.000 2.03e+05 2.35e+05
C(codezipcode)[T.57] 1.572e+05 7446.852 21.105 0.000 1.43e+05 1.72e+05
C(codezipcode)[T.58] 2.662e+05 8881.610 29.971 0.000 2.49e+05 2.84e+05
C(codezipcode)[T.59] 2.526e+05 8194.731 30.823 0.000 2.37e+05 2.69e+05
C(codezipcode)[T.60] 1.273e+05 8510.393 14.953 0.000 1.11e+05 1.44e+05
C(codezipcode)[T.61] 6.024e+04 1.5e+04 4.009 0.000 3.08e+04 8.97e+04
C(codezipcode)[T.62] 1.507e+05 7642.570 19.713 0.000 1.36e+05 1.66e+05
C(codezipcode)[T.63] 1.257e+05 8943.098 14.060 0.000 1.08e+05 1.43e+05
C(codezipcode)[T.64] 5.018e+04 8644.492 5.805 0.000 3.32e+04 6.71e+04
C(codezipcode)[T.65] 2.334e+05 9049.067 25.788 0.000 2.16e+05 2.51e+05
C(codezipcode)[T.66] 7.103e+04 8827.970 8.046 0.000 5.37e+04 8.83e+04
C(codezipcode)[T.67] 3.886e+04 1.11e+04 3.495 0.000 1.71e+04 6.06e+04
C(codezipcode)[T.68] 4.401e+04 8430.032 5.220 0.000 2.75e+04 6.05e+04
C(codezipcode)[T.69] 3.616e+05 8719.498 41.473 0.000 3.45e+05 3.79e+05
grade 3.585e+04 1156.270 31.001 0.000 3.36e+04 3.81e+04
sca_sqft_living 1.052e+05 1713.520 61.385 0.000 1.02e+05 1.09e+05
sca_sqft_living15 3.822e+04 1745.215 21.902 0.000 3.48e+04 4.16e+04
Omnibus: 2089.717 Durbin-Watson: 2.006
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6994.413
Skew: 0.739 Prob(JB): 0.00
Kurtosis: 6.093 Cond. No. 542.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [129]:
results
Out[129]:
set# R_square_train MSE_train R_square_test MSE_test
0 0 0.800082 8.039554e+09 0.786466 8.393782e+09
1 1 0.797780 8.106427e+09 0.793682 8.188346e+09
2 2 0.796430 8.126212e+09 0.797342 8.142686e+09
3 3 0.796651 8.172856e+09 0.797048 7.990854e+09
4 4 0.798724 8.059876e+09 0.790235 8.349535e+09
5 5 0.795339 8.209724e+09 0.800561 7.898741e+09
6 6 0.796121 8.051210e+09 0.798113 8.372649e+09
7 7 0.798503 8.003434e+09 0.791384 8.507967e+09
8 8 0.797038 8.085759e+09 0.795678 8.261575e+09
9 9 0.798469 8.074819e+09 0.791620 8.283129e+09
10 10 0.798077 8.087727e+09 0.792365 8.262448e+09

Q-Q Plots

In [130]:
resid1=model.resid
fig = sm.graphics.qqplot(resid1, dist=stats.norm, line='45', fit=True,marker='.')
In [131]:
df_model = pd.concat([df_run[try_modeling],df_run['price']],axis=1)
# df_model.to_csv('df_housing_model.csv')
# df_run.to_csv('df_house_all_final_data.csv')
In [132]:
results.describe()
Out[132]:
set# R_square_train MSE_train R_square_test MSE_test
count 11.000000 11.000000 1.100000e+01 11.000000 1.100000e+01
mean 5.000000 0.797565 8.092509e+09 0.794045 8.241065e+09
std 3.316625 0.001383 5.951058e+07 0.004116 1.784755e+08
min 0.000000 0.795339 8.003434e+09 0.786466 7.898741e+09
25% 2.500000 0.796540 8.055543e+09 0.791502 8.165516e+09
50% 5.000000 0.797780 8.085759e+09 0.793682 8.262448e+09
75% 7.500000 0.798486 8.116320e+09 0.797195 8.361092e+09
max 10.000000 0.800082 8.209724e+09 0.800561 8.507967e+09

FINAL MODEL - New

  • For k=10 fold validation, with price as target variable:

    • mean r_squared for the test sets was 0.797, with mean MSE = 8.158691e+09

    f1 = 'price~C(codezipcode)+C(grade)+sca_sqft_living+sca_sqft_living15'

  • Predictors in final model:

    • 'Zipcode'
    • 'grade'
    • 'sqft_living'
    • 'sqfr_living15'
  • My final model indicates that the size, location, and housing geade to be critical components in determining salesprice.

Predictor Coefficients & Their Affect On Sales Price

  • Grade, sqft_living, and sqft_living15 all have straight-forward relationships with sales price, with positive coefficients.
    • sqft_living is a larger component of the price (coefficient: 1.043e+05)
    • grade (coeff: 3.679e+04)and sqft_living15 (coeff: 3.767e+04) have a similar magnitude of an effect on sales price
  • Zipcode is a bit trickier, as each zipcode has its own coefficient.
    • Overall, zipcode has a positive coefficient/effect on price (to varying degrees)
    • There are, however, a couple zipcodes that negatively impact sales price.

Future Directions

  • With more time I would have proceeded to perform the following steps:
    • Additional changes to the predictors in the model
      • Trying to remove outliers from zipcodes
    • Explored additional transformations to the data.
      • I was trying to be conservative to keep the interpretability of my model intact.
      • Log-transforming the data improved the distributions but made it more difficult to interpret.
    • I would further tweak the quality of the visuals, particularly changing the x-tick labels and rotation.
In [133]:
df_dropped.index
Out[133]:
RangeIndex(start=0, stop=21597, step=1)
In [134]:
df_run.describe()
Out[134]:
price bedrooms bathrooms sqft_living sqft_lot floors condition grade sqft_living15 sqft_lot15 is_reno sca_sqft_living sca_sqft_lot sca_sqft_living15 sca_sqft_lot15 sca_bedrooms sca_bathrooms
count 1.905300e+04 19053.000000 19053.000000 19053.000000 1.905300e+04 19053.000000 19053.000000 19053.000000 19053.000000 19053.000000 19053.000000 19053.000000 19053.000000 19053.000000 19053.000000 19053.000000 19053.000000
mean 4.681033e+05 3.293025 2.017136 1928.589513 1.403172e+04 1.468142 3.411746 7.496720 1892.003464 12000.351703 0.029287 0.100602 1.262394 0.129118 0.945326 0.293025 0.017136
std 1.999766e+05 0.780532 0.659094 699.470303 3.882067e+04 0.536934 0.647548 0.978593 567.785762 24743.514984 0.168613 0.713745 7.491446 0.718716 5.210258 0.780532 0.659094
min 7.800000e+04 2.000000 0.750000 440.000000 5.200000e+02 1.000000 1.000000 4.000000 399.000000 651.000000 0.000000 -1.418367 -1.345041 -1.760759 -1.444515 -1.000000 -1.250000
25% 3.135000e+05 3.000000 1.500000 1400.000000 5.000000e+03 1.000000 3.000000 7.000000 1460.000000 5026.000000 0.000000 -0.438776 -0.480509 -0.417722 -0.523268 0.000000 -0.500000
50% 4.310000e+05 3.000000 2.000000 1830.000000 7.490000e+03 1.000000 3.000000 7.000000 1790.000000 7511.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 5.890000e+05 4.000000 2.500000 2380.000000 1.018200e+04 2.000000 4.000000 8.000000 2250.000000 9775.000000 0.000000 0.561224 0.519491 0.582278 0.476732 1.000000 0.500000
max 1.120000e+06 5.000000 3.500000 4230.000000 1.651359e+06 3.500000 5.000000 11.000000 3660.000000 560617.000000 1.000000 2.448980 317.226746 2.367089 116.467888 2.000000 1.500000
In [135]:
import pandas as pd

df_final_data=pd.concat([df_run, df_dropped[['lat','long','id']]],axis=1)
df_final_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21597 entries, 0 to 21596
Data columns (total 30 columns):
price                    19053 non-null float64
bedrooms                 19053 non-null float64
bathrooms                19053 non-null float64
sqft_living              19053 non-null float64
sqft_lot                 19053 non-null float64
floors                   19053 non-null float64
condition                19053 non-null float64
grade                    19053 non-null float64
zipcode                  19053 non-null category
sqft_living15            19053 non-null float64
sqft_lot15               19053 non-null float64
code_view                19053 non-null category
code_waterfront          19053 non-null category
is_reno                  19053 non-null float64
bins_yrbuilt             19053 non-null category
bins_sqftbasement        19053 non-null category
bins_sqftabove           19053 non-null category
sca_sqft_living          19053 non-null float64
sca_sqft_lot             19053 non-null float64
sca_sqft_living15        19053 non-null float64
sca_sqft_lot15           19053 non-null float64
sca_bedrooms             19053 non-null float64
sca_bathrooms            19053 non-null float64
codebins_yrbuilt         19053 non-null category
codebins_sqftbasement    19053 non-null category
codebins_sqftabove       19053 non-null category
codezipcode              19053 non-null category
lat                      21597 non-null float64
long                     21597 non-null float64
id                       21597 non-null int64
dtypes: category(10), float64(19), int64(1)
memory usage: 3.7 MB
In [136]:
# save final output
df_final_data.to_csv('kc_housing_model_df_final_data.csv')

SUMMARY FIGURES FOR PRESENTATION

In [137]:
# Reset the visual style of the notebook
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rcParams.update(inline_rc)

# inline_rc = dict(mpl.rcParams)
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\matplotlib\__init__.py:855: MatplotlibDeprecationWarning: 
examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\matplotlib\__init__.py:846: MatplotlibDeprecationWarning: 
The text.latex.unicode rcparam was deprecated in Matplotlib 2.2 and will be removed in 3.1.
  "2.2", name=key, obj_type="rcparam", addendum=addendum)
In [138]:
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
plt.style.use('dark_background')
# plt.style.use('dark')


# Define the figure and axes and the # of subplots, sharing the y axes
fig, ax = plt.subplots(figsize=(16,12), ncols=2, nrows=2, sharey=True)

## Defining Formatting to be Used

# Formatting dollar sign labels
fmtPrice = '${x:,.0f}'
tickPrice = mtick.StrMethodFormatter(fmtPrice)

# Axis Label fonts
fontTitle = {'fontsize': 20,
           'fontweight': 'bold',
            'fontfamily':'serif'}

fontAxis = {'fontsize': 16,
           'fontweight': 'bold',
            'fontfamily':'serif'}

fontTicks = {'fontsize': 12,
           'fontweight':'medium',
            'fontfamily':'serif'}

# The amount of space above titles
y_title_margin = 1.01

# Major title
# plt.suptitle("Critical Factors for Predicting Sales Price", y = 1.0, fontdict=fontTitle, fontsize=22)


## Subplot 1
i,j=0,0
ax[i,j].set_title("Zipcode",y = y_title_margin,fontdict=fontTitle)#, y = y_title_margin)
sns.stripplot(df_final_data['zipcode'],df_final_data['price'],ax=ax[i,j],marker='o',size=3)

# Remove xticks
ax[i,j].set_xticks([]), ax[i,j].set_xlabel('')

# Change y-tick labels
ax[i,j].set_ylabel('Price',fontdict=fontAxis)

yticklab = ax[i,j].get_yticklabels()
ax[i,j].set_yticklabels(yticklab,fontdict=fontTicks)
ax[i,j].get_yaxis().set_major_formatter(tickPrice)

# Set y-grid
ax[i, j].set_axisbelow(True)
ax[i, j].grid(axis='y',ls='--')



## Subplot 2
i,j = 0,1
ax[i,j].set_title("Housing Grade",y = y_title_margin,fontdict=fontTitle)
ax[i,j].title.set_fontsize(20)
sns.stripplot(df_final_data['grade'],df_final_data['price'],ax=ax[i,j],marker='o',size=2)

#Set x axis
xticklab=ax[i,j].get_xticklabels()
ax[i,j].set_xticklabels(xticklab,fontdict=fontTicks)
ax[i,j].set_xlabel('Grade')


# Change y-tick labels
ax[i,j].set_ylabel('')# 'Price',fontdict=fontAxis)

# Set y-grid
ax[i, j].set_axisbelow(True)
ax[i, j].grid(axis='y',ls='--')


yticklab = ax[i,j].get_yticklabels()
ax[i,j].set_yticklabels(yticklab,fontdict=fontTicks)
ax[i,j].get_yaxis().set_major_formatter(tickPrice) 


## Subplot 3
i,j = 1,0

# Title
ax[i,j].set_title("Living Space (sqft)",y = y_title_margin,fontdict=fontTitle)
ax[i,j].title.set_fontsize(20)

# Define the scatter plot and line graph aesthetics
line_kws={"color":"orange","alpha":0.5,"lw":8,"ls":":"}
scatter_kws={'s': 5, 'alpha': 0.5,'marker':'.','color':'red'}


# Plot seaborn plot 
sns.regplot(df_final_data['sqft_living'], df_final_data['price'],ax=ax[i,j], scatter_kws=scatter_kws, line_kws=line_kws) #,marker='o',size=2) 
# sns.stripplot(df_final_data['sqft_living'], df_final_data['price'],ax=ax[i,j],marker='.') #,marker='o',size=2)

## Change the x-axis 
ax[i,j].set_xlabel('Area (sqft)',fontdict=fontAxis)

# Get ticks, rotate labels, and return
# xticks = ax[i,j].get_xticks()
xticklab=ax[i,j].get_xticklabels()
ax[i,j].set_xticklabels(xticklab,fontdict=fontTicks, rotation=45)

# Change the major units of x-axis
ax[i,j].xaxis.set_major_locator(mtick.MultipleLocator(500))
ax[i,j].xaxis.set_major_formatter(mtick.ScalarFormatter())

## Change y-axis
# Change y-tick labels
ax[i,j].set_ylabel('Price',fontdict=fontAxis)

yticklab = ax[i,j].get_yticklabels()
ax[i,j].set_yticklabels(yticklab,fontdict=fontTicks)
ax[i,j].get_yaxis().set_major_formatter(tickPrice) 

# Set y-grid
ax[i, j].set_axisbelow(True)
ax[i, j].grid(axis='y',ls='--')



# ## Subplot 4
i,j = 1,1
ax[i,j].set_title("Neighbor's Living Space (sqft)",y = y_title_margin,fontdict=fontTitle)
ax[i,j].title.set_fontsize(20)

# Define the scatter plot and line graph aesthetics
line_kws={"color":"lime","alpha":0.5,"lw":8,"ls":":"}
scatter_kws={'s': 5, 'alpha': 0.5,'marker':'.','color':'blueviolet'}

# Plot seaborn plot 
sns.regplot(df_final_data['sqft_living15'], df_final_data['price'],ax=ax[i,j], scatter_kws=scatter_kws, line_kws=line_kws)

# Change the x-axis labels
ax[i,j].set_xlabel('Area (sqft)',fontdict=fontAxis)

# Get ticks, rotate labels, and return
xticklab=ax[i,j].get_xticklabels()
ax[i,j].set_xticklabels(xticklab,fontdict=fontTicks, rotation=45)

# Change the major units of x-axis
ax[i,j].xaxis.set_major_locator(mtick.MultipleLocator(500))
ax[i,j].xaxis.set_major_formatter(mtick.ScalarFormatter())

# Change y-tick labels
ax[i,j].set_ylabel('')#Price',fontdict=fontAxis)

yticklab = ax[i,j].get_yticklabels()
ax[i,j].set_yticklabels(yticklab,fontdict=fontTicks)
ax[i,j].get_yaxis().set_major_formatter(tickPrice) 

# Set y-grid
ax[i, j].set_axisbelow(True)
ax[i, j].grid(axis='y',ls='--')

plt.tight_layout()
plt.savefig("summary_figure.png") # save as png
In [ ]:
 
In [139]:
from IPython.display import Image
Image("map_median_price.png")
Out[139]:
In [140]:
Image("map_latlong_price.png")
Out[140]:
In [141]:
plt.savefig("summary_figure.png") # save as png
<Figure size 432x288 with 0 Axes>
In [142]:
df_run_ols.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 5 columns):
price                19053 non-null float64
codezipcode          19053 non-null category
grade                19053 non-null int64
sca_sqft_living      19053 non-null float64
sca_sqft_living15    19053 non-null float64
dtypes: category(1), float64(3), int64(1)
memory usage: 1.4 MB
In [ ]:
df_run_ols.describe()


END OF PROJECT 1 RE-SUBMISSION



import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
plt.style.use('dark_background')

# figSum, ax = plt.subplots(nrows=2,ncols=2,figsize=(16,16),sharey=True)
fig, ax = plt.subplots(figsize=(16,12), ncols=2, nrows=2, sharey=True)


# Formatting dollar sign labels
fmt = '${x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)

# Axis Label fonts
fontTitle = {'fontsize': 20,
           'fontweight': 'bold',
            'fontfamily':'serif'}

fontAxis = {'fontsize': 16,
           'fontweight': 'bold',
            'fontfamily':'serif'}

fontTicks = {'fontsize': 12,
           'fontweight':'medium',
            'fontfamily':'serif'}


# The amount of space above titles
y_title_margin = 1.01

# Major title
# plt.suptitle("Critical Factors for Predicting Sales Price", y = 1.0, fontdict=fontTitle, fontsize=22)

## Subplot 1
i,j=0,0
ax[i,j].set_title( "Zipcode", y = y_title_margin, fontdict = fontTitle)

sns.stripplot(df_run_ols['codezipcode'],df_run_ols['price'],ax=ax[i,j],marker='o',size=3,jitter=True)

# Remove xticks
ax[i,j].set_xticks([]), ax[i][j].set_xlabel('')

# Change y-tick labels
ax[i][j].set_ylabel('Price',fontdict=fontAxis)

yticklab = ax[i][j].get_yticklabels()
ax[i][j].set_yticklabels(yticklab,fontdict=fontTicks)
ax[i,j].get_yaxis().set_major_formatter(tick)
Final Model (original)
  • For k=10 fold validation, with logZ_price as target variable:
    • mean r2 = 0.596, mean MSE = 0.405
  • Predictors in final model:

    • 'logZ_sqft_living'
    • 'logZ_sqft_living15'
    • 'bedrooms'
    • 'bathrooms'
    • 'floors'
    • grade
    • view
    • sqft_above
    • sqft_basement
    • waterfront
  • My final model indicates that several aspects of the size of the house are important for determining price: (sqft) of the living areas, as well as the presence/size of basements, in addition to the # of floors, bedrooms, and bathrooms. Additionally, more qualitative aspects of the neighborhood/area also play and important role, such as if the house is waterfront property, the aesthetic quality of the view, as well as housing-grade

In [ ]:
 
In [ ]: